Last year, we set ourselves the goal to build the largest secrets detection library, helping developers catch all sorts of secrets, one commit at a time: API keys, SSH credentials, database connection strings, generic passwords, and much more.
If you use GitGuardian or regularly follow our product updates, you are already familiar with the pace at which our R&D team is delivering. But we just didn’t expect our engineers to be THIS fast. Earlier this week, they released their 300th detector 🎉
Our library has now grown large enough for us to have a go at some basic spreadsheet formulae such as SUM() and COUNTIF(). So we crunched the numbers and thought we could share some of the findings with you!
Update 2021-11-03 — After the publication of this article, our engineering team introduced a minor change in the secrets detectors categorization. Previously, each detector corresponded to an algorithm that detects one type of secrets. From now on, different detectors (i.e algorithms) looking for the same secret, a GitLab personal access token, for example, will be regrouped together. The GitGuardian library still boasts more than 300 secrets detectors, now regrouped in 260 groups.
Show me the numbers
First, how did we get to 300 detectors?
Through consistent and regular releases. With an R&D engineering team focused on building the best algorithms in the industry to detect secrets and credentials, we have released 20 detectors on average, every quarter.
Looking at the cumulative total of detectors released, this is nothing short of a challenge — even for a five-strong team of engineers.
To give you a rough idea of what goes behind every release, here’s what a typical development lifecycle looks like — from inception to release:
- What is this secret associated with? Does it belong to an identified service?
- What does it give access to? How are developers using it in their code?
Design & build
- How should we detect this secret? Using regex rulesets? Applying advanced high-entropy filtering methods?
- What is the context in the code surrounding the secret?
- How can we harness this context to enhance the capabilities of our detector?
- Which training dataset should we run the detector against?
- What can we learn from the results of this test to optimize for precision and performance in our next iteration?
- Is this new detector going to add a strain on our computing resources?
- How can we ensure its high availability?
- In the weeks and months following the release, how can we further enhance the speed and precision of the detector?
Moreover, our R&D engineering team scrupulously abides by the following principle: “Every detector is entitled to maintenance and performance optimization in the long run — it is part of the lifecycle.”
With an ever-growing library, we were afraid maintenance would put a dent in the rate at which new detectors are released. But with hindsight, the steadiness of the slope (20 new detectors released per quarter) clearly shows this was an overstated fear and that our team has put in place robust processes to scale its operations.
What does a tech stack look like in 2021?
In the pie chart above, we grouped the secrets detectors by category. With a total of 300 detectors, we’re able to show how modular the modern tech stack is. Developers no longer build web applications from scratch, instead, they rely on a wide range of tools to:
- Host and manage their code with Version Control Systems like GitLab, GitHub or Bitbucket
- Build and run their applications with cloud platforms like AWS or Digital Ocean
- Authenticate users with identity providers like Auth0 or Okta
- Store data with MongoDB or Amazon Redshift
- Communicate with users with Twilio or Vonage APIs
- Accept payments with Stripe or Braintree
- Monitor their applications and services with Datadog or New Relic
- Handle custom business logic with multiple internal APIs and microservices
The list above barely scratches the surface. Developers also run and maintain the tools to manage internal operations, empowering sales and marketing with CRM integrations, product teams with analytics, and everyone else with collaboration and instant-messaging tools.
Let’s also take a look at the categories that lead in terms of leaked secrets. For each detector, we computed its frequency i.e the number of occurrences it catches for every million commits. For example, our extensive research on public GitHub shows that about 992 Google API keys are leaked for every million commits.
Aggregating these frequencies by category of detectors gives us the weight of each category in “real-life”. These are the types of secrets developers will most likely slip into their source code.
Over half of these fall into the ‘Other’ category, meaning that they belong to the class of generic secrets. These secrets do not directly reveal what service they are associated with but they remain sensitive and need to be caught: username and password pairs, generic database assignments.
We recently wrote a detailed article on why we believe detecting generic secrets should be part of any application security program and how we solve it at GitGuardian. Read on to find out more — Why detecting generic credentials is a game-changer.
The list of secrets and credentials GitGuardian can catch in your source code is available here — detectors and their supported credentials. If you can't find a detector you are looking for, fill the form below or drop me a line at firstname.lastname@example.org.