Hunting for secrets in Docker Hub: what we’ve found
Henri HubertÉcole Polytechnique | Engineering Graduatespecializing in Mathematics and Computer Science Lead Engineer | GitGuardian | Team Secrets |
Source code is tightly linked to secret sprawl, but unfortunately, this is not the only origin of sensitive information leaks.
Security teams looking to secure an application's entire perimeter need to consider all possible sources where sensitive information like secrets could leak. Today we will dive deeply into an often-overlooked source: Docker images.
In this article, we will explain why Docker images can contain sensitive information and give some examples of the type of secrets we find in public Docker images. Finally, we will compare our results to the ones we have with source code scanning.
Reminder on Docker images
First of all, what is a Docker container? Simply put, a container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
Now, what is a Docker image? A container uses an isolated file system. This file system is described by a Docker image and contains everything required to run the application: dependencies, source code, binaries, environment variables and some metadata.
However, it contains more information than just the current state: Docker images are built as a stack of modifications (just like some VCS do) and from an image, it is possible to retrieve each of the previous steps and the modifications applied. SRE teams mainly use Docker for portability and easy software deployment.
A Docker image is built using a set of instructions and configuration grouped in a file called a Dockerfile.
Dockerfile example:
FROM ubuntu:18.04 // first layer
COPY . /app // second layer
RUN make /app // third layer
CMD python /app/app.py
Why do we find sensitive information in Docker images?
Where do the secrets come from?
First of all, Docker images contain source code and this code is likely to contain secrets. While source code can be scanned from its Version Control System using tools like GitGuardian, this is not sufficient to protect the entire perimeter from secret sprawl because these secrets can be absent from the VCS since the code published in the image may be altered later by the publisher or the publishing process. Ultimately this means that these Docker images can bypass the security checks that are in place. An example of this would be a developer building an image from his local project with unpublished changes (such as files in the .gitignore) and then publishing the image.
The second source of secrets is the configuration of the Dockerfile. Secrets can be added through the Dockerfile, either directly in the file or by adding a file containing secrets because they are needed for build or operations: these include credentials to access a package manager, API keys used by the application wrapped in the image etc… Most Docker images require credentials and since Docker is mainly used to be run on any machine, it may sound correct at first to include the secrets as well.
Finally, a Docker image is built from stacked layers applied to form the current state of the image. This type of structure is very prone to leaks because a layer can hide the secrets from the previous one so that it is not visible in the final state while still being in the image. Moreover, unlike source code, no one digs into Docker image layers to review it.
What is the problem?
Just as source code, Docker images are likely to be published in shared repositories, publicly on hub.docker.com, or in a company registry and therefore accessible to users who are not allowed to access the contained secrets.
A recent example of a supply chain attack carried out thanks to the discovery of a credential in a Docker image is Codecov. The Codecov Docker image contained git credentials that allowed an attacker to gain access to Codecov’s private git repositories and insert a backdoor in their product, which would later affect a huge number of Codecov's 22,000 users.
There are many ways credentials could be abused by an attacker, you can read about another attack scenario in this series Thinking Like a Hacker.
For all those reasons, we decided to test and implement a dedicated secret scanner to find secrets in Docker images.
Methodology used to discover secrets
As explained above, secrets are likely to be embedded in images in several places and at different stages of the build.
When building a Docker image from scratch, most of the layers consist of the installation of tools such as Debian or language-specific packages. These are not the layers containing secrets.
Layers most likely to contain secrets are either the ones where files are manually added or copied, or where environment variables are modified. Fortunately, Docker images contain a manifest file that describes all the different operations performed to build the image. This manifest can be used to select the layers that are related to custom commands from the user. Doing so has one major benefit: scan is much faster, we don’t need to scan all Debian files for each image, we only scan what needs to be.
After extracting the relevant layers, we extract the files and the environment variables included in it and send them to our scanner. For details on our scanning process, you can read our blog article about GitGuardian’s secret detection engine.
In order to test this method, we gathered 2,000 public images recently pushed to Docker Hub. We scrapped the Docker Hub API to retrieve the latest publicly published images, pulled and downloaded those, parsed and sent their files to our secret scanner. As expected we did find secrets directly in the images.
Actually, 7% of the images contained at least one secret. Secrets distribution is displayed in the following table with a comparison to the results obtained with source code:
Key category | percentage in docker images | percentage in code |
Other | 66,92 | 57,83 |
Private key | 22,19 | 2,76 |
Development tool | 1,8 | 3,97 |
Data storage | 1,58 | 6,44 |
Cloud Provider | 1,08 | 9,21 |
Version control platform | 1,08 | 7,51 |
Messaging system | 0,64 | 4,59 |
Social network | 0,43 | 1,55 |
Payment system | 0,43 | 0,24 |
CRM | 0 | 3,53 |
Monitoring | 0 | 1,23 |
Collaboration tool | 0 | 0,84 |
Identity provider | 0 | 0,16 |
Cryptos | 0 | 0,07 |
The over representation of the Other category is due to generic detectors such as our Generic High Entropy Secret detector. Those detectors allow us to improve our recall but they don’t provide information on the secret provider. In the past few years, we have had a strong focus on detecting specific secrets used by developers in source code, however, it seems that secrets embedded in Docker images are different and may be more related to internal services than what we are used to. This is why we have such a high proportion of this secrets category.
Differences between code and Docker images
The first insight is that we have a much higher proportion of private keys than in source code. This is because private keys are often used for system communication and authentication in containers rather than in applications.
We can also see that there is far less diversity in the secrets types than in public source code. There may be several reasons for that:
- Docker images are not used as much as a draft space as source code may be. Because of that, Docker images contain only the “final” state of a service. Developers push much less images than source code repositories. Docker claims it hosts more than 8.5M images which is a drop in the ocean comparing to the hundreds of millions of public repositories hosted by GitHub. The more volume, the more diversity you have, this is especially true with leaks.
- Most of the public Docker images are bases for other applications and well-known open source software. Because of that, the packaged application often has very few links to the outside and is not built to communicate with dozens of services as modern applications do. While this is true for public images (the object of our study), private images need to package a final application and therefore have much more odds of exposing secrets. Unfortunately, this analysis was made using only public images.
More surprisingly, we did not find as many secrets as in public source code: the ratio is close to 1 secret for every 500 files scanned in Docker images, half of what we usually find in public source code. In our opinion, this is because users pushing docker images are more concerned about security than developers in general.
During our investigation, we did not explore secrets criticality, however, we think that the secrets found in Docker images are more sensitive because they are more likely to be related to infrastructure services.
Conclusion
Docker images, because of their structure and usage, are likely to contain hidden secrets. As our research shows, there are fewer secrets hidden in public Docker images than in source code on GitHub but still 7% of the public images contain secrets. This remains a major security issue for companies willing to ensure proper secret management. While good development practices can reduce the margin of error when building the images, automatic scanning can avoid many leaks. This is why GitGuardian recently released a Docker image scanner that can be used in CI to ensure that Docker images are free from secrets. You can find more information on how to scan your docker images using gg-shield in this documentation.
If you want to know more about Docker security, we also summarized some of the best practices in a cheat sheet.