Building reliable secrets detection - Secrets in source code (episode 3/3)
In our last two posts, we took a deep dive into how secrets sprawl and why secrets inside git are such a problem. Both of those articles brought up automated secrets detection as part of the solution.
At GitGuardian, we are always eager to transparently share the technical details of what we do. This article will expose how our algorithms detect secrets and what we have learnt from scanning, literally, billions of commits.
A recap on secret sprawl
Secrets, like API keys, credentials, and security certificates, are the crown jewels of organizations. They provide access to the most sensitive systems and data. But there is a conundrum we face when dealing with secrets—they need to be both tightly controlled and secured, but they also need to be widely distributed to team members, applications and infrastructure. This can result in secrets being sprawled: saved locally, shared through messaging systems or internal wikis, hardcoded into source-code...
These secrets can be buried deep in systems like the command line history of your most widely accessed server, application logs or the git history, making it very difficult for them to be detected.
Why detecting secrets is challenging
At first glance, the challenge of detecting secrets may seem obvious, identify specific patterns of known keys within code. The reality is much more complex. At GitGuardian we launched our first detection algorithms in 2017 with much of the same optimism. Since then our algorithms have been scanning every single public commit ever made to GitHub, to get an idea of the magnitude of that, that's nearly 1 billion commits per year and today we are detecting over 1 million secrets each year. This scale of data combined with feedback loops has allowed for some truly fascinating discoveries about how secrets move through code and the identifiers of a true positive.
Secrets detection is probabilistic—that is to say that it is not always possible to determine what is a true secret (or true positive). We need to factor in the probability of a true secret based on different parameters and indicators. Some secrets have fixed patterns, but most do not, they are of different lengths, use a variety of different character sets and appear in a variety of different contexts. This makes it extremely challenging to accurately capture all true secrets without also capturing false positives.
At some point, a line in the sand needs to be drawn that considers the cost of a secret going undetected (a false negative) and compares it to the outcome that too many false positives would create. At what point does the tool lose its effectiveness?
How to detect secrets
Secrets detection is a two-step process that starts with finding potential secret candidates. The most delicate aspect of secrets detection is being able to effectively filter your results to exclude false positives by factoring in the different indicators. This is the second step. We will first run through the common methods used to discover and validate secret types and then use GitGuardian as a case study to show what we do over and above this.
Step 1. Detect secret candidates
Looking at the current solutions available, there are two well-known approaches of how to detect secret candidates in code: detecting high entropy strings, and using regular expression (regex) to detect known patterns in secrets (we call these pre or post-fixed secrets).
Detecting high entropy strings
High entropy strings are computer-generated strings that use mathematics to create a string of characters that appear random. The more random the string, the higher the entropy. Secrets use high entropy to allow different services to independently issue secrets like API keys without worrying about creating a potential conflict (two keys the same).
You may be interested to learn that you can actually measure the entropy of a string.
Using regular expression to detect secrets
In addition to searching for high entropy strings, the other common method is to find keys that have some kind of definable, distinctive pattern.
For example, Stripe keys, which are highly sensitive, are prefixed keys. They all begin with the same characters, ‘sk_live_’. Using regular expression (regex), we can create some specific search criteria for these keys.
stripe key: sk_live_3hmB4s6o0a62C7vrsK00sBJPb3z4CzY9GSEz1dfMtloMec9LpD949IbDPwbeW |
This key is an example key of course :)
How the methods compare
Method | Pros | Cons |
Entropy: look for strings that appear random | Good for penetration testing, open sourcing a project or bug bounties because it brings a lot of results. These results must be reviewed manually. | Lots of false alerts (it is very frequent to see URLs, file paths, database IDs or other hashes with high entropy), which makes it impossible to use this method alone in an automated pipeline. • Some keys are inevitably missed because the entropy threshold to be applied depends on the charset used to generate the key and its length |
Regular expressions: match known, distinct patterns | Low number of false alerts. • Known patterns make it easier to later check if the secret is valid or not or if this is an example or test key (see Step 2). | • Unknown key types will be missing • Credentials without a distinct pattern will be missed, which means lots of missed credentials! Think about passwords that can be virtually any string in many possible contexts, APIs that don’t have a distinct format, ... |
As expected, one way does not provide better results over the other generally, reliable secrets detection should use both methods for different scenarios and different secrets. Regular expression will only be possible with a limited number of secrets types; high entropy should be used to capture a much larger range of secret types. I will go into detail on how GitGuardian does this later.
Step 2: Filter bad candidates
Finding potential secret candidates is only part of the solution. You now need to be able to filter out the false positives and leave the true positives. This is very challenging because we must now aggregate various different weak signals surrounding the candidate to determine if it is in fact a true positive or a false positive. For example, it could be a placeholder key, a high entropy string used as a unique identification number or a public key or even a URL, it is nearly impossible to determine this if we just look at the string itself.
There are three methods to use when filtering secret candidates:
- Validate the candidate by doing an API call
- Use a dictionary of anti-patterns.
- Look for known sensitive patterns inside the code (weak signals)
Again, just as with detection methods, each of these has pros and cons and should be used in combination when trying to filter specific secrets.
Method | Pros | Cons |
Look for known sensitive patterns in the context of the candidate. The idea is to aggregate weak signals. For example, a sensitive filename, combined with an assignment variable containing the word “key” in it, and the import of a Python wrapper for the Datadog API. | Often allows to associate a presumed credential with a given service depending on the code surrounding it. This is helpful to validate the candidate by doing an API call, see next method!
| The notion of “context” is difficult to define (think of a large commit patch or file for example, or a variable declared in one location and used somewhere else in the repository). |
Validate the candidate by doing an API call against the associated service. | There can be no more doubt, your candidate is valid! Plus you can use the opportunity to gather information about permissions associated with the key and account owner. This information is useful for prioritization and remediation purposes. | You need to know the associated service, or at least come up with a list of potential services. • Not all credentials can be easily checked programmatically. Think about OAuth strings, private keys, usernames and passwords, ... • Some services are not accessible from anywhere (like outside of a given private network), so the credential might be considered invalid despite still posing a threat. |
Use a dictionary of anti-patterns to get rid of certain example or test keys. The presumed credential should not contain linguistic sequences of characters. | Allows to filter certain credentials like those containing “EXAMPLE” or “TEST” or “XXXX” in them, or those found in test files or directories. | There is no real con, this method is always good to implement, but won’t be able to filter all examples or test keys. |
GitGuardian as a case study
Now that we have established common methods for detecting and filtering secret candidates, we can go into exactly how GitGuardian implements all these methods and the secret sauce behind the algorithm.
It will likely be no surprise when I say that GitGuardian uses all of the methods outlined above. But what will be of interest is how each is implemented.
Monolith vs. specific detectors
A key difference with GitGuardian detection capabilities is the concept of building specific detectors. This allows us to be able to select the most effective method of detection and filtering, for each specific secret.
Building a single monolith algorithm to detect all secret candidates and batch filtering them makes it difficult to make any adjustments. Having a higher precision and recall on one secret type could mean having a decreased result for other detectors.
I liken this to trying to use a Cargo ship to fish. You make adjustments to this huge monolith to try and target a specific type of fish which takes a huge amount of resources. Then by the time the adjustments have been made, the ship is now off course for all other fish you are trying to target. You end up making constant adjustments, at a large cost of resources and never find the ideal positioning.
Comparing this to the method GitGuardian adopts, it’s like having hundreds of individual small boats each for a specific type of fish, you can make as many alterations as you want without having an effect on any other boat.
What is important in this approach, is that it is not just the detection method that is specific to each secret. It is the filtering method also.
Having individual detectors means that no compromises need to be made when choosing the method of detection and filtering. It also means that a layered approach can be adopted where secrets discovered through distinctive regular expressions can be given an increased weight over secrets discovered through more generic entropy detection.
Filtering secret candidates however is really where the individual detector method proves to yield the strongest benefits. Of course there are some universal indicators that a secret is not valid, for example if it is a URL. But there are a huge volume of indicators that influence the likelihood of a true positive that are specific to individual secrets. For example, sensitive patterns around the context of the secret may differ greatly between secret types, or different libraries of anti-patterns may have different influences on results. Different dependencies such as API wrappers can change the true positive rate. By looking at each secret with a specific and independent looking glass allows us to be able to fine tune our results to a level not possible at a universal level.
The challenge in this strategy is of course aggregating all these weak signals for each independent secret, and to be able to discover these often very subtle indicators we need to analyze enormous amounts of data.
The unfair advantage
Secrets detection is, after all, probabilistic. Being able to distinguish between a true positive and a false positive when evaluating secrets detection is not a simple binary yes or no result. It falls back to hundreds of influencing properties that are evaluated to determine likelihood of a true positive result.
Anyone that has worked on probabilistic or classification algorithms knows that the key is having a huge amount of data to feed into the algorithm. In addition secrets are constantly changing, think of the ecosystem of the last 5 years, how many new services are you using now? Have external services changed properties of their secrets? Are there new packages available for different services? I’m sure you can get the point, secrets detection algorithms need to change, and constantly be improved and updated. This is where GitGuardian did something totally unfair. As mentioned at the start, GitGuardian started scanning all public GitHub commits. One billion commits a year for over 3 years.
Each time GitGuardian discovered a potential secret they were able to alert the developer and then gather feedback. Some feed was explicit such as marking the alert a true or false positive, but we also factored in implicit feedback like if the repository was deleted following the alert. All of this information was then fed back into the algorithm, this gave some expected and some totally unexpected results on some of the weak signals that influenced the true positive rate. Without using a similar scale of data, it will be difficult to gather and note all the influencing factors of secrets detection.
The good news is that you don’t need to build this algorithm to detect secrets. We have made our service available to anyone, and you can test it on your git repositories with our GitHub app or use our API and algorithm in your own scripts to find secrets anywhere.
Test our algorithm todayKey takeaways
The most common methods of detecting secrets are identifying high entropy strings and using regular expressions to find definable patterns. Only a few keys can be identified with regex as it requires consistent patterns to exist with each secret, other secrets can be identified by detecting strings with high entropy. Neither method produces completely accurate results on its own therefore candidates need to be filtered. Common methods of filtering secrets include: using dictionaries of anti-patterns, analysing context of a presumed secret and validating credentials with an API call.
Filtering candidates is where a lot of the challenge behind secrets detection exists. By using GitGuardian as a case study, we can see that best results can be achieved by creating individual detectors and aggregating specific, often weak signals to effectively detect and filter potential secrets. These characteristics and weak signals can be difficult to detect without analyzing a huge amount of data. This is the unfair advantage GitGuardian has after scanning every public commit on GitHub since 2016, nearly 1 billion commits a year, meaning they have been able to train their algorithms and detect weak signals based on this.