Pierre Lalanne

ISAE-Supaéro aeronautics engineering graduate
specializing in data science

Data Scientist | GitGuardian | Team Secrets

At our core, you probably know that we are a company specializing in detecting secrets (if you don’t know what a secret is, please take a moment here and come back).
Very early on, we had to address the question: what would be a good way to categorize secrets?

Take a look at this:

AWS_ACCESS_KEY_ID = AKIAX52MPYOTPRUCRC22
AWS_SECRET_ACCESS_KEY = hjshnk5ex5u34565d4654HJKGjhz545d89sjkjak

and this:

connect_to_db(host=”136.12.43.86”, port=8130, username=”root”, password=”m42ploz2wd”)

You can spot the difference: the first one is tied to a well identified service, AWS, while for the second, things are a bit blurrier: we immediately understand that it has to do with a database connection, but, without further context it doesn’t tell us what doors it opens.

That's the most basic distinction we can make when scanning source code: some secrets are specific, since they are somehow self-revealing, while others are said to be generic, because we cannot be so sure what they give access to..

In this article, we intend to explore why detecting generic credentials is an absolute must have for a secrets detection engine. We will also explain how we addressed this topic at GitGuardian, and give some insights on our findings.

Specific detection has advantages but is not sufficient

Detecting specific credentials has at least two big advantages over detecting generic ones.

First of all, we are often able to test the validity of specific credentials, which can give us 100% confidence in the secret’s validity. This ensures a very good precision overall.

Second, they are often associated with very well known patterns, sometimes even prefixed ones, or at least a very specific context. This means a very good recall is easy to obtain.

But don’t be fooled: being “the easy part of the game” doesn’t mean that they are any less valuable: eventually, the user can be provided with very detailed information about the exposed secret, the risks incurred or the correct way to revoke credentials. That’s what we thrive to do in our public documentation for the 300+ detectors we are covering.

Although we have drastically improved our average time to develop a new specific detector, shrinking it from 2 days to 2 hours, scaling this list is not easy. The reason is straightforward: the number of API providers is growing at a very fast rate.


But what about the other category? As said earlier, some of the credentials we found are simply not linkable to any particular service: think about contextless passwords, combinations of usernames and passwords for an internal service, or just an API key with a very generic name. We estimate that almost half of the secrets we find belong to this category.

💡
You may be wondering what happens if some credentials are detected by both generic detectors and specific detectors. In that particular case, GitGuardian always gives priority to the specific detector for the reasons we listed above. But note that this is not an issue and even rather a clue that our generic detection performs well and can act as a failsafe in case something went wrong with the concerned specific detector.

You get where this is going: if a secrets detection engine wants to achieve the best possible precision AND recall, it needs a tailored and powerful detection for generic credentials.

How it’s done at GitGuardian

Why generic detection is not so easy to do…

As the name suggests, when looking for generic credentials, the contextual information we are looking for is… generic. Narrowing down candidates is therefore a bit more complicated. For instance, targeting all the password keywords is obviously not as effective a filter as targeting files containing aws and client_secret.

Generic credentials detection difficulty results from 3 factors. First, they are widely different, being made from very broad patterns: charset and length can be almost anything. Second, how they are supposed to be used is also unknown. Third, even when the credential is clearly identified, we have no way to check its validity.

By the way, a quick reminder on the importance of having both a good precision and a good recall. Take for example this valid, generic, secret:

# Define variables
apikey = as.NbtuEaorueoFu435n&stau

We could certainly catch this one by filtering for all the random looking strings in our engine (namely, high entropy strings). But we would also certainly catch a lot of random strings that are not secrets (think UUIDs, hashes...), ruining our precision rate. So entropy alone is not a sufficient criteria if we want to limit noise and save the engineers from alert fatigue.

On the other hand, if the engine only targets very specific assignments like apikey = abc , we would miss a lot of generic credentials that are valid secrets resulting in low recall. Worse, we never know for sure what the proportion of missed secrets is (e.g. the rate of false negatives). For the user, it means a low level of confidence in the tool.

Generic detection is a real challenge that requires techniques of its own. At GitGuardian, our approach is twofold: first, the idea is to maximize recall and avoid blind spots by looking for very broad assignments in source code. Second, we want to have powerful tools to sort the results and discard false positives in an efficient way, so to guarantee a high precision and avoid alert fatigue.

GitGuardian’s arsenal and tools

As mentioned earlier, an important and first part of our approach is to detect a wide variety of assignments in source code. To do so, we came up with a wide variety of possible assignments inspired from many languages. Here are some tricky examples that we can detect:

'password': [mAEapzCoNVpwrCz6ErRvOZm0B7g]
pass -> “mAEapzCoNVpwrCz6ErRvOZm0B7g”
{“passwd”: “mAEapzCoNVpwrCz6ErRvOZm0B7g”}
<config name="password"><value>mAEapzCoNVpwrCz6ErRvOZm0B7g</value>

Having this capability significantly improves our recall. Then an important part of our work is to discard false positives as early as possible in the process. At GitGuardian, we designed a wide arsenal of post validation steps to decide whether a secret should be processed any further or not. Here are details about some of these so-called post-validators.

ContextWindowPostValidator:
This post validator bans irrelevant matches based on contextual information. For instance, we consider that a match that contains pubkey in its close context can safely be discarded.

set_pubkey(key=”mAEapzCoNVpwrCz6ErRvOZm0B7g”)

CommonValuesBanlist:
This PostValidator leverages dynamic banlists that are computed and adjusted according to the live monitoring of GitHub. More specifically, we are looking for example keys, or patterns that are so common that we consider them as invalid secrets. Here are some simple examples:

placeholder
example
passphrase
changeme

And many other common values for passwords, usernames or high entropy values.

AssignmentBanlistPostValidator :
That’s a very powerful and unique feature of GitGuardian’s secrets detection engine. For each language, we are able to identify the variable to which a secret was assigned, if it exists. We can then ban some patterns in the assignment variable. For instance all assigned variables containing “uuid” suggest that the value matched is not a secret but an identifier.

ORDER_TOKEN_UUID = 'afe005ae-e4fa-4ec5-919a-93c32fd8268f'

Key Figures and Insights

At the end of the day, GitGuardian has developed more than ten generic detectors, scoring between 85% and 95% for precision according to our benchmarks. These are the most common:

Type % of Generic Secrets
Generic Password ~20%
Generic Database Assignment ~20%
Generic High Entropy Secret >50%
Generic Username/Password 3%
Generic Company Email/Password 1%

Overall, generic detectors account for 45.4% of all the secrets we detect. This means that any secrets detection solution that does not implement generic detection algorithms, misses at least half of the secrets present out there.

Another interesting metric : close to 25% of secrets found by specific detectors would have also been found by generic detectors. This indicates that generic detectors appear to be a very good fallback in case a specific detector behaves badly or simply does not exist yet.

Finally, our efforts to improve generic detection brought up interesting side-effects that we were able to exploit:

  • Detecting pattern drift:
    When we detect that a specific detector yields less credentials over time, and if in the meantime, we witness the appearance of generic credentials with the name of the concerned provider in their context, we can conclude that a change occurred in the pattern for this provider. This has proven very useful to constantly be up to date with vendor’s changes.
  • Detecting new candidates for specific detectors:
    If all of a sudden a lot of generic credentials mention a given word in their context, we can conclude that this corresponds to a new API provider gaining notoriety. We even have an internal tool to infer the pattern for the concerned credentials and be up to date with developer’s practices as fast as possible.

Conclusion

Implementing solid generic detection capabilities is a significant improvement for recall while keeping a very good precision. It is therefore a huge competitive advantage compared to other tools. What’s more, generic detection offers some serenity for our customers: we may not have a specific detector targeting this very special kind of secret, but, in most cases, our generic detectors have our customer’s back, and we keep getting better at detecting generic credentials.