Implementing a detector at GitGuardian : a use case with MongoDB credentials

Pierre Lalanne

ISAE-Supaéro aeronautics engineering graduate
specializing in data science

Data Scientist | GitGuardian | Team Secrets

In previous blog posts we saw the subtleties of recall and precision when detecting secrets in source code. We also detailed GitGuardian’s engine philosophy. Today, we would like to share how our research team, the Secrets Team,  develops and refines detectors at GitGuardian.

To illustrate this article, we will take the case of MongoDB credentials. MongoDB is a very popular no-SQL document oriented database… and with fame come leaks… Our real-time detection engine continuously scans public events on GitHub and raises more than 3,000 alerts per week related to MongoDB credentials found in source code ! Let’s see how we achieve this.

Context

MULTIMATCH DETECTION

To connect to a MongoDB database, a host name is required to identify the server and a username and password couple is used for authentication. Although a password leaked alone is already a threat, an alert should be raised only when we find a complete secret made of these three matches.

Being able to detect such “multimatch” secrets is one of the many things  that sets the  GitGuardian’s detection engine apart. Working on multimatch secrets is a good opportunity to improve the overall precision of the algorithm as only complete hence severe leaks will raise an alert. It is also a way of reducing alert fatigue as matches will be intelligently grouped under a single coherent alert instead of multiplying alerts for each part of the same secret.

Yet, this requires a very efficient and customizable detection engine as the combinations of possible matches can quickly become overwhelming.

PERFORMANCE FIRST

Since its foundation in 2017, GitGuardian has been a  data-oriented company. We have been continuously scanning the live feed of GitHub events for more than 3 years, that is scanning over 10 million documents every day with a mean time to detect of a couple of seconds. That’s the reason why we are obsessed with the performance of our detection engine.

PRECISION AND RECALL

To achieve effective detection, we aim to: :

  • Detect as many MongoDB credentials as possible: That is to say, we want a high recall.
  • Raise an alert only when we are sure we found MongoDB credentials: we want very few false positives, we also don’t want to see other database credentials such as MySQL, MSSQL or Redis credentials in our results. In other words we want high precision.

AUTHENTICATION METHODS

To maximize our recall, we look at three distinct patterns of MongoDB authentication in source code :

  1. First of all, MongoDB credentials can be spread in the source-code in variable assignments.
    DB_HOST="mongo.com"
    DB_PORT=5434
    DB_username="root"
    DB_password="m42ploz2wd"
    DB_NAME="paul"

2. Another common authentication method consists in using a URI connection string that is fed to the driver used to connect to Mongo.

CONNECTION_URI="mongodb://root:m42ploz2wd@mongo.com:5434/mydb"
client = new MongoClient(CONNECTION_URI)

3. Finally, connection to a mongoDB instance can also be done via the mongo Command Line Interface.

mongodb --user root --password m42ploz2wd --host mongo.com

Each of these authentication methods have their own detector used to identify them. Let’s see how we built those.

Pre-validation

No matter which of these three detectors we will be working on, some common bricks may help to achieve great detection. When developing a new type of detector, our research team first tries to identify the typology of documents in which credentials are located. In other words, we want to be able to very quickly discard documents that never contain any MongoDB credentials.

In our specific case, we chose two criteria:

  • One on file extensions: we discarded some file extensions, as these files are known not to contain credentials. For instance : .storyboard, .css, .lock files.
  • One on the file content: we discarded files that don’t contain the word “mongo” as we can safely infer that no mongoDB credentials can be found there.

This kind of heuristics can speed up our detection capabilities by up to a 20x factor and remove many false positives! That’s good trade-off when we are looking at a feed of >100 documents per second! Let’s stick to this configuration for the three types of authentication we are looking at.

MongoDB credentials in assignments

In this section, we want to develop a detector to catch a host, a username and a password that are all set in the same document.

About Assignments

An assignment is any code structure where a value is assigned to a variable. GitGuardian has developed a unique assignment matcher that can detect a wide variety of assignments, no matter in which language they are written. Here is a small list of possible assignments we would detect :

host = “s3cr3th0st” ⇒  variable is host, and value is “s3cr3th0st”
 d = {“host”: “s3cr3th0st”} ⇒ variable is host, and value is “s3cr3th0st”
setHost(“s3cr3th0st”) ⇒ variable is setHost, and value is “s3cr3th0st”
<input name=”host” value=”s3cr3th0st”>⇒variable is host, and value is “s3cr3th0st”

Broad Detector

As we can detect these patterns, let’s refine a bit and add some constraints to create a first version of our detector.

For each match (host, username and password), we want the assigned variable to have db or database or mongo as a prefix, and host, username or password as a suffix. Note that having db as a prefix is sufficient for credentials to be considered mongoDB credentials as “mongo” has to be in the file content.
 The variable also has to follow a dedicated regex:

  • The host should either be a regular hostname, or an IPv4 address. We use standard regex practices for this.
  • The username and password have a custom length and charset that we have carefully identified as relevant by battle-testing our algorithms on GitHub history.

WHAT WE WOULD MATCH :

HOST
mongo_host = pelixcluster.qzf9a.mongodb.net
mongo_host = mongocluster.example.com
mongo_host = 192.168.0.3
mongo_host = xxxxxxxxxxxx.mongodb.net
USERNAME
mongo_user = placeholder
mongo_user = robindoe96
mongo_username = rdoe@acmecompany.com
PASSWORD
mongo_pass = sup3rs3cr3t!P@ss
mongo_password = examplepassword
mongo_pass = password_here

WHAT WE WOULD NOT MATCH :

HOST
redis_host = 47.100.42.206 ⇒ (mongo, db or database are not in the variable name)
USERNAME
redis_user = robindoe96 ⇒ (mongo, db or database are not in the variable name)
mongo_user = x ⇒ Insufficient length
PASSWORD
mongo_password = xx ⇒ Insufficient length

Improvements with post-validation step

At that point, although the results are interesting, we would like to further select the secrets by removing some values. Remember that in the end, we want to output multi match secrets made of the combination of all caught host, username and password. To avoid a combinatorial explosion, we want to discard irrelevant matches as early as possible!

For this, GitGuardian’s engineering team has come up with a range of very efficient post validators.

For instance, we developed a post-validation step that removes any common irrelevant value for hosts. Thanks to our live scanning of the GitHub flow, we are able to dynamically adapt this list based on the results we get. It contains a variety of useless values such as: example.com, dummy.com, .default, local IPs and many others! We apply the same kind of heuristic to the username and password previously matched.

We can ban other custom values from the results. For instance, while running the current detector we saw many false positives values in the following form:

creds = [config.mongo_host, config.mongo_username, config.mongo_password, config.ttl]

Therefore config is added to our banlist of values.

Eventually, we also ban values based on their Shannon entropy to get rid of some placeholders. For instance a xxxxxxxxxxxxx.mongonet.com won’t have a high enough entropy to be considered a valid host. The threshold to use is defined empirically by testing the algorithm in real condition.

Once correct matches for host, username and password are gathered, we compute all possible combinations to have secret candidates. Although we could stop here, GitGuardian has a final weapon to clean things up and yield only the most relevant results. We have developed various strategies to narrow down the possible secret candidates within a single document. These strategies can be based on the distance between matches in the document, but also on the similarity between assignment variables for instance.

In other words, this avoids raising an alert when we find a host at the beginning of the document, a username in the middle of it, and a password at the end, as these three values are most probably not related to the same resource! Or, this allows to discard a secret where the host is assigned to a mongo_host, whereas the username and password are assigned respectively to exotic_db_user and exotic_db_password.

In the end, this detector catches 30 secrets each week with a very high precision. Thanks to our live models and feedback loops, we are able to quickly iterate to refine results.

MongoDB credentials in URI connection strings

URI Pattern

The second type of authentication that appeared to be very common is the use of URI connection strings to provide a host, username and password (and even port, database or query) to a driver.

As the URI connection string format is standardized for MongoDB, the detector to develop is rather straightforward. We are looking for a string with the following format:

scheme://username:password@host[:port][/database[?query]]

In our case, the scheme we are looking for is usually prepended by “mongodb”, and the overall regular expression ensures a rather high level of confidence.

Post-validation

Without any validation step, the result still needs a bit of work :

  • 20% of results have a password equal or close to “password”
    mongodb://root:<password>@......

  • Almost 10% of values are linked to a “localhost” or local IP
    addressmongodb+srv://root:123456@localhost:27017

  • And 2-3% contain “xxx” or other forms of placeholders
    mongodb+srv://a_super_user:${process.env.PASS}@host.gcp.mongodb.net/project
    mongodb+srv://a_super_user:' + process.env.MONGO_ATLAS_PW + ‘@host.mongodb.net/main_db
    mongodb+srv://a_super_user:"+str(os.getenv("MONGO_PASSWORD"))+”@host.mongodb.net/prod_db

We thus introduce the ad-hoc PostValidation step, in the same fashion as presented for mongo assignment. In the end, this detector matches more than 2,950 MongoDB credentials every week, again, with a high level of precision!

MongoDB credentials in shell commands

Details matter

A last authentication method that we wanted to tackle is the connection to a mongoDB instance using the command line interface. Of course, we are aware that the amount of credentials found with this method will be way lower than the other methods, as committing shell commands to a git repository is somewhat rare.

However, looking at all GitHub during 3 years teaches some basic rules: one of which could be: “everything happens”! What’s more, even if this implementation improves our recall by 0.1%, we want it! GitGuardian deals with rare events by definition, and we cannot afford missing TP (True Positive).

To address this CLI detector, we developed a specific type of detector that can look for a given command and capture all its relevant options.

mongodb --user randomman --password w@ri0rors123 --host mongodb.mywebsite.com
mongodb -u randomman -pw@ri0rors123 --host mongodb.mywebsite.com

About PostValidation, we follow almost the same validation steps as for previous detectors.

Eventually, this detector matches a set of mongoDB credentials once every two days on average. This may seem a drop in the ocean… but it may be the poisonous one that will cause a lot of trouble… so might as well!

Conclusion

To wrap it all, we’ve presented in great detail how our detectors work and we showed you a part of our wide collection of tools used to achieve the best detection on the market.  The development of new detectors is mainly based on two core principles. We always iterate from large to precise detectors by multiplying benchmarks on real-world data. And we constantly improve our detection algorithms by monitoring their live performance. Our secrets team has proudly reached the 260 detectors mark a few days ago, check out the list here!