2022 has been the year of source code leaks; Microsoft, Nvidia, Samsung, Rockstar, and many more companies have had their source code involuntarily open-sourced. But some new research by CyberNews has revealed that there are millions of private git repositories that are, in fact, not all that private. In this article, we will take a look at the research on exposed git repositories, review why this can be such a problem, and suggest what you can do differently.

Nearly 2 million exposed git repositories

Git is a technology that nearly all software developers use to collaborate and version control their software. You will likely be familiar with git repository hosts like GitHub, BitBucket, or GitLab, which all offer turnkey solutions to sign up and start pushing code to your own repositories and collaborating with others. Git can be a tricky technology and prone to user errors that can result in sensitive information being exposed. For example, when you create a new git repository on your machine, a .git folder is created. This folder contains all the information and meta-data about your project since it was created. If you made an edit from 2012 to your application, 10 years later, that edit is still hidden in that .git folder. If you commit an API key on a development branch 3 years ago, it's still inside this .git folder. Basically, unless you are certain you and no one on your team has ever committed anything remotely sensitive, which will be godly if true, this is likely a very sensitive folder.

Research Results

New research from CyberNews has uncovered that huge amounts of these .git folders are not only hosted remotely but also publicly accessible.

In a study to determine how many self-hosted git repositories were, in fact, unintentionally public, they discovered a shocking 1,930,000 repositories that were remotely accessible.

Exposed .git folders by country source
Exposed .git folders by country source
“Having public access to the .git folder could lead to the exposure of the source code. Tools required to get parts or full source code from the .git folder are free and well-known, which could lead to many more internal leaks or easier access to the system for a malicious actor,”
Martynas Vareikis, a researcher at Cybernews, said.

CyberNews isn’t alone in research like this. A smaller experiment was conducted by SDCat, a popular technology and security blogger who scanned 2.3 million domains to find git repositories and discovered

  • 1053 fully or partially exposed git repositories
  • 12 usernames with passwords in the git config data

These research projects, plus the countless related breaches we have had in the last two years show what a huge issue this truly is.

"Even after I parallelized the scanning script it took some days to scan the 2.6 million domains. I did not expect many results, but was surprised how widespread the problem is."
SDCat

How do .git folders become exposed?

There are many ways .git sprawls into locations they might not be intended. It could be a misconfiguration of a backup or an attempt to host your own git server, but usually, it is a deployment issue. One example that occurred multiple times was with a static website, if someone is using an Amazon S3 bucket to host their site, instead uploading the current version they have uploaded an entire directory including the .git folder. For anyone who understands how sensitive these are it will seem unlikely and shocking that this would happen, but it happens, nearly 2 million times that we know of.

Why exposing git repositories is so problematic?

Git repositories are not designed to contain any sensitive information. They are designed to enable collaboration and sharing of source code between developers and sometimes the community at large. While source code can be a valuable asset to companies, arguably the most valuable asset a company has, source code in itself is often not so valuable to other parties and well-designed applications shouldn’t become vulnerable just because their source code is exposed. So why you might ask, is it such a problem that source code gets leaked out? The answer is that source code often contains sensitive information that should not be there. Secrets like API keys, security certificates, and other credentials are very often exposed in source code. You can read lots more about this specific problem in some other blogs (......) but here are the key stats

  • An average-sized company with 400 developers will have 13,000 secrets (1,000 unique) inside their private repositories
  • GitGuardian scanned all public GitHub repositories and found over 6,000,000 secrets in 2021
  • 3 out of every 1000 commits GitGuardian scanned contained at least one secret

There is a perfect storm resulting from the fact that git allows such easy collaboration of developers, secrets are meant to be programmatic, and that a git history never dies. While the research project by CyberNews didn't scan each repository for secrets in depth. They did find that 6% of the git repositories had their deployment credentials in the git configuration file….. I’m going to say that again, slowly.

> 6% of the exposed git repositories had the credentials to deploy their applications, publicly accessible to the world, in the configuration file!

Screenshot of a configuration file with deployment credentials
Screenshot of a configuration file with deployment credentials

Often companies ignore the huge problem of exposed credentials inside git repositories because they hide behind the argument that the code is private and therefore shouldn’t be exposed. Recent history tells us that this isn’t the case.

Last year Twitch’s git repositories were found to have a configuration error making them publicly accessible (In a similar way to the CyberNews study discovered) which led to the entire source code for all their projects (even the secret ones) being exposed along with about 6,600 secrets.

Even governments aren’t immune to this problem with the Indian government having a massive breach that revealed there were hundreds of exposed git servers that revealed huge amounts of sensitive files including security certificates and even police reports. All because private git repositories weren’t actually private.

Source code, nearly always, contains more than just source code. In the history of a project, on often forgotten development branches sensitive information is hidden. This is why even though source code might not be considered a security-critical asset, it needs to be protected and this is why private code repositories that are public are such a big concern.

What can we do?

The answer is obviously to make sure our git repositories are private right?

Well not quite. This research adds to the compelling pile of evidence that git repositories are not appropriate places to contain sensitive information. If your git repository is protected it becomes harder, but not impossible, for a bad actor to gain access to them. In 2021 the supply chain attack of CodeCov meant bad actors got access to up to 20,000 CodeCov users' private git repositories including HashiCorp, Twilio, and Rapid7 even though these were never exposed publicly. We have also seen companies like Uber have their repositories breached due to a compromised developer account. The point is that repositories have been exposed to bad actors as a weak point in our infrastructure and we need to secure them in more than one way.

  1. Make sure they are private, segments between critical and noncritical projects, and developers have two-factor authentication, 2FA, enabled.
  2. Ensure our repositories don’t contain sensitive information like secrets in their histories through automated secrets detection and scanning. Tools like GitGuardian help a lot when it comes to this which can be used both on hosted repositories or on self-hosted repositories.
  3. Scan our domains and infrastructure to reveal if we have exposed .git repositories and other critical infrastructure. You can scan your domains and subdomains with many tools such as Amass or dirsearch to name a couple.

Summary

While there is huge evidence to show that git repositories are high-value targets for adversaries, we can add to this evidence that these repositories are easily accessible to attackers via domain and IP scanning searching for .git folders. Yes, we must better protect these repositories and scan our own infrastructure for exposed weaknesses but we also must ensure sensitive data like secrets are not our repositories as a minimum effort for security.