2 years ago, GitGuardian first conducted a study to see how many Docker images contained secrets like private keys, API keys, and other credentials. In this study, we looked at 10,000 Docker images and found nearly 5% (4.6% to be exact) contained plain text secrets. Now a new, more comprehensive academic study has found that the problem is in fact much much worse! Researchers at RWTH Aachen University in Germany did a study of over 300,000 Docker images, uncovering that 8.5% of them contained plain text credentials like API keys and private keys.

Dockerhub and Docker images

There is plenty of information explaining container technology and how we use it, but as a recap, a container is a modern way of shipping a software application. With a container you don’t just ship the application, you ship all the components that the application needs to run inside it. What is important to understand is that Docker images contain all the code your application needs to run and in addition contains a file called a Docker file which is a set of instructions for the environment Docker needs to create to run your application. All caught up now? If not check out this helpful guide.
Lastly, this study mostly looked at docker images on dockerhub which is the largest host to store and share docker images with more then 9,000,000 images hosted on the service.

About the study and findings

Researchers at RWTH analyzed 337,171 Docker images to discover secrets hidden inside. Other similar studies have been done, including by GitGuardian, however, this is the largest study to date. Some other studies have reviewed a higher volume of images but did not analyze the entire contents, instead opting to look into a single element, for example, the Docker file. What the researchers found was quite astonishing! 8.5% of the images contained secrets. The categories of these secrets were:

  • 52,107 private keys
  • 2,920 cloud keys
  • 213 social media keys
  • 25 keys for financial tools (Stripe, Square, PayPal brain tree, etc)

How accurate is the study?

The researchers went to great lengths to explain their method of validating both private keys and API keys. This includes filtering private keys against the kompromat list and excluding known tests and previously compromised secrets. The team also did in-depth filtering on API keys to filter out obvious false positives. The best method of filtering API keys is to validate them with their providers, in this research study, however, the ethics restrictions didn’t allow them to make calls against providers. Instead, they relied on a series of post-discovery filtering which included using keywords to determine if the keys were false or not.

Detecting secrets is unfortunately probabilistic and because we cannot test all keys found you can never say for certain that 100% of the keys are valid, likewise you cannot say that 100% of keys were actually found. Using the scientific methods of discovery as outlined in the study it is safe to say that the majority of keys were real and valid at the time of leaking.

How do keys get leaked?

One of the most interesting aspects of this study was in the section “Origin of leaked secrets”. Most studies into vulnerabilities stop at finding the vulnerability and do not review how the vulnerability actually made it into the project in the first place. What the study found is that there were distinct differences between how private keys were leaked and how API keys were leaked. The study showed most private keys were leaked. “Most API secrets are typically inserted by file operations (File), e.g., copied from the image creator’s host system, private keys are predominantly included by executing a command within the Dockerfile” .

In plain English, this means API keys are more likely to be included in the Docker image via hard-coded keys in the source code whereas private keys are more likely to be included in the Docker file during the build process. The researchers also gave concrete examples of exactly how private keys, in particular, were added to the Docker image “30 % of private keys were generated inlayers where image creators installed the OpenSSH server. Since the installation triggers ssh-keygen to generate a fresh host key pair, it is automatically included in the image”.

Could these keys be used in the wild?

Private keys are hard to validate because, unlike API keys, there isn’t a simple way to call a service and see if it’s valid. One of the most fascinating results of the research was the fact that the security researchers proved that many of the private keys found could in fact be used in real life.

With delving too far into the private key rabbit hole. Private keys are used to create a mathematically linked public key that creates a secure connection between services and systems. The researchers in this case used the exposed private keys to create corresponding public keys and match them with other public keys being used on the internet. What this means is that if the researchers found a match they would potentially be able to launch a number of different attacks on that service including eavesdropping, relaying data, or altering the sensitive data transmitted. The team found a whopping 275,269 hosts that were using compromised private keys for secure connections. Even worse than that they found an insane number of sensitive services using these kets including:

  • 8,674 MQTT and 19 AMQP hosts (potentially transferring privacy-sensitive IoT data),
  • 6,672 FTP,
  • 426 PostgreSQL,
  • 3 Elasticsearch, and
  • 3 MySQL instances serving potentially confidential data

Most surprising results from the study!

This study mostly looked at public Docker images, images shared on DockerHub with the intention that they would be available to anyone (public). But it did also look into 28,621 images that were hosted in private container stores. Myself, and most other people I spoke to would assume that a Docker image that is private, i.e. not meant to be accessed without authorization, would contain more secrets than a public one. This is true in studies done between private and public git repositories for example. The most surprising part of this study was that public Docker images were MORE likely to contain secrets than private ones. Now of course the sample size was smaller but the difference was quite significant. 9% of public Docker images contained plain text secrets and only 6% of the private Docker images contained secrets.

Wrap-up

The researchers at RWTH Aachen University did a great job not just conducting a comprehensive study but also explaining their methods and validation processes. While we already knew Docker images contained plain text secret, due to the complexity of scanning these sometimes very large artifacts, we never had such a conclusive and comprehensive study. The study not only proves the widespread problem with leaked secrets in Docker images but also proves that these secrets, in particular private keys, can be used in the wild by attackers for various malicious activities.