A few weeks back, the New York Times experienced a significant breach, with all its git repositories leaked via a torrent on the 4chan forum. We took a deep dive into this and were able to find a huge amount of secrets, like API keys, lurking in the code. In fact, over 4,000 unique secrets and at least 200 critical secrets were uncovered. In this article, we explore what we found, how we found it, and what it means. 

šŸ’”
Disclaimer: Prior to publication, we responsibly disclosed our findings to the New York Times security team and consciously avoided any action that could have disrupted their ongoing incident response efforts (see Methodology below). The NYT has since confirmed its awareness of the leaks discussed in this article.
"The Times has been aware of this event since January and has made efforts to harden its security posture as part of its remediation, and we appreciate any engaged analysis from the cybersecurity community on how organizations can learn from incidents like this.", said a spokesperson for The Times to GitGuardian

What was leaked? 

In total, over 5,600 repositories were leaked. If you are wondering how the New York Times had so many repositories, you will likely not be surprised that many of these were forked repositories used as dependencies. That being said, there were still a huge number of private repositories from The New York Times, including the viral game Wordle. 

So what is the risk of source code like this being leaked? Well, there are many reasons why you donā€™t want your source code out in the open, it can open the door for competitors to create duplicates, and malicious actors can use it to try and find vulnerabilities, but the worst result of a leak like this is all the secrets that can be exposed. In this context, when I refer to secrets, I am discussing digital authentication credentials like API keys, security certificates, and other credentials. 

Previous data breaches that involved source code have frequently led to lots of these secrets being exposed. In a similar breach, all of Twitch's source code was leaked, and over 6,000 secrets were exposed. Therefore it is not all that surprising how many secrets were discovered within the New York Times' source code. 

In our research, we discovered 4,875 unique secrets that were leaked by developers from the New York Times!

Number of secrets within source code - The chart excludes secrets for unknown services

I do want to point out now that while no secrets should be leaked, the risk of each secret varies greatly, so not all of these secrets would lead to further risk. Additionally, the number of true secrets is likely to be lower due to the limitations of our research, specifically involving two areas ā€˜generic secretsā€™ and ā€˜validation of secretsā€™. 

Generic Secrets: 2,477 of the secrets discovered were classified into the ā€˜Otherā€™ category for secrets. These are secrets for unknown services. This means that the pattern of the string and the surrounding code match the requirements for a secret, but we were unable to determine what the secret does. This category will have the highest number of false positives but it is impossible to determine how many are false positives without significant investigation on each case. 

Validation of secrets: The best way to remove false positives when discovering secrets is to validate the known secrets. Usually, this is done by making a non-intrusive API call to service the secrets is connected to. For example, if we find an AWS key, we will make an API call to AWS to validate it. In this case, we did NOT validate the secrets. This is because this can be perceived as malicious activity, and we do not want to disrupt any ongoing cyberdefense operations by the New York Times. We also hope that at this stage, most of the secrets would be made invalid, already skewing the results. 

So how bad was the breach? 

While the number of leaked secrets is alarming, it is not abnormal. I would even go so far as to say it is likely better than what we would see of an organization of similar size. That being said, there is certainly a lot of opportunity for attackers. At least 228 of the secrets discovered are what we would consider Critical Keys. These include items like HashiCorp Vault tokens, Auth0 Keys, and AWS keys. There are also lots of keys to interesting services like SendGrid and Mailchimp which potentially could be used to send mass emails to NY Times subscribers. However, because this source code was publicly shared, it is very likely that these keys would have been revoked (at least we hope so), so the risk is reduced. 

Critical Secrets

Methodology 

Scanning the entire codebase: The first step was to scan every git repository included in the codebase. To do this, we used our own ggshield, which scans for more than 300 types of secrets using entropy statistics and unique pattern matches with fine-grained rules. In total, after an initial scan, we discovered over 100,000 secret candidates and 131 different types of secrets. 

Removal of irrelevant data: Most of the repositories were forks of open-source tools, so we only included secrets that were committed by someone using a *@nytimes.com email address to make sure we didnā€™t include any irrelevant results. This reduced the number of secrets to a total of over 48,000 secrets under 113 different categories. It is likely that some relevant secrets were missed through this process.

Removal of duplicates: Once we had our initial list of 48,000 secrets, we removed any duplicate secrets to end up with a total of 4,875 unique secrets committed by developers with a @nytimes.com email address. (it is quite common for secrets to be duplicated multiple times once they end up in a VCS like git, which explains the duplicates). 

How did the leak happen? 

The New York Times commented that  ā€œThe underlying event related to last weekā€™s posting occurred in January 2024 when a credential to a cloud-based third-party code platform was inadvertently made availableā€.

NYT later confirmed to GitGuardian that a publicly leaked GitHub token was the root cause of the source code breach. They also confirmed being alerted to this issue back in January.

Honeytokens For Peace Of Mind: Using Cyber Deception To Buy Time To Remediate At Scale
GitGuardian Honeytokens bring peace of mind that you are safe from leaks and attacks while tackling secrets sprawl, no matter how many repos or developers you have.

This is not the first time a leaked GitHub token has been compromised to get access to more sensitive data:

How Hackers Used Stolen GitHub Tokens to Access Private Source Code
Attackers have used stolen OAuth tokens issued to Travis CI and Heroku to gain access to private git repositories on GitHub. Here we take a look at exactly what happened, why itā€™s significant, and how to mitigate the issue.

Are there any other leaks? 

Running its security audit on the NYT domain, GitGuardian also found 634 secrets (268 confirmed valid) that were leaked in public GitHub repositories by New York Times developers, mostly on personal accounts. However, at the time of writing, these secrets had not been validated by the NYT.

Secrets leaked by NY Times developers in Public GitHub repositories Source

For example, here is an Artifactory API Key belonging to the NYT leaked in a public repo in January:

Leaked Artifactory secret in public GitHub repository 

GitGuardian Security audit is a free tool that allows you to discover how many secrets your developers have leaked on public GitHub, both company-related and personal: check how many secrets have been leaked by your organization.

What are the risks? 

There are many potential risks associated with leaked private source code. The most immediate threat is hard-coded secretsā€”if these fall into the wrong hands, an attacker could impersonate legitimate users or systems, compromising data and services or even launching more sophisticated lateral attacks.

But the risks don't stop there. Closer examination of leaked code might reveal logic flaws or the use of insecure dependencies. Even the exposed application architecture itself could be valuable to attackers, potentially guiding them to hidden assets.

In this particular case, the New York Times seems to have acted swiftly to mitigate these risks. They stated:

ā€œThere is no indication of unauthorized access to Times-owned systems nor impact to our operations related to this event. Our security measures include continuous monitoring for anomalous activity.ā€

The NYT also directly confirmed to GitGuardian that they employed various monitoring tools, enabling their security teams to take defensive action and revoke any valid leaked credentials.

A more dire scenario could have unfolded if a malicious actor had quietly discovered access keys to the private repositories without publicizing them. This could have afforded them ample time to uncover leaked credentials and orchestrate more damaging attacks.

Lesson Learnt

Source code is very easy to leak. It is accessed by many developers and stored in multiple locations like VCS, Backups, Developers' machines, etc., so it has a large blast radius. The biggest takeaway isnā€™t that the source code here was exposed; it was that we need to make sure we donā€™t have secrets exposed within our source code!