portrait

Tiexin Guo

OS Developer @Ubuntu
CNCF ambassador | LinkedIn

In some ways, software development has become both simpler and more complex.

It's simpler because we can, for example, create a minimal modern web application with just 5 lines of code after one command to install a single dependency. Gone are the days when we had to write all our code as a monolith that could run alone without anything else.

However, it's more complex precisely because of the same reason: we only write 10% of the app or even less, then we've got tons of direct and indirect dependencies.

And this gave rise to the supply chain management and supply chain attacks. In many cases, the easiest way for malicious actors to hack into our app, surprisingly, is by not hacking our app itself, but rather, by manipulating the app's software supply chain to inject malicious code.

Today, let's look at dependencies: What are dependency confusion attacks, how do they happen, why is dependency management hard, and how to prevent them from happening.


Dependency Confusion Attacks

As software becomes more structurally complex, supply chain attacks have become increasingly common.

In recent years, a new type of supply chain attack emerged - dependency confusion attacks: In 2021, a researcher Alex Birsan managed to breach over 35 major companies' internal systems, including Microsoft, Apple, PayPal, Shopify, Netflix, Yelp, Tesla, and Uber. The researcher himself called this type of vulnerability "dependency confusion".

What are dependency confusion attacks, then?

Simply put, a dependency confusion attack is a type of supply chain attack where attackers publish malicious packages to public registries with the same name as internally developed private packages, which causes package managers to download the malicious package from the public registry instead of the private one. It confuses the package managers, hence "dependency confusing" attack.


A More Detailed Explanation

Basically, dependency confusion attacks exploit the way how modern package managers such as NPM (Node Package Manager) and PyPI (Python Package Index) work.

Many companies use packages and dependencies from both public registries and private ones. But when package managers try to install a package, they will search in both the public and the private registries, and then choose a "better" match. For example, the one with the higher version, or a better tag match such as the Python version or platform tag.

If a package with the exact same name exists on a public registry, that package could be downloaded instead of the legitimate package on the private registry.

So, if hackers know the name of the dependency used internally and it's not registered in the public registry, it is possible to register it in the public registry with the same name and create a higher version with malicious code injected. From there on, they are distributed downstream automatically.


More Often and Serious Than You Think

You might think dependency confusion attacks aren't so big a deal, because nobody knows your internal private package name, and maybe many companies don't have internal packages, but the facts say otherwise.

Based on research from the Orca Cloud Security Platform, nearly half of organizations are vulnerable to dependency confusion attacks, making it an extremely common problem:

By analyzing NPM and PyPI packages stored in cloud environments scanned by the Orca Platform, we found that as many as 49% of organizations have at least one vulnerable asset, and over 28% have 50 assets or more that are potentially vulnerable to a Dependency Confusion attack.

And, private dependency/package names can be leaked in many places. For example, NPM dependency names are in package.json and package-lock.json files which can be leaked from many places, such as unsecured CI systems, code repositories, etc. If not handled properly, error messages and error responses can also leak used package names. Application Javascript files might also contain them.


Dependency Management: No Mean Feat

So, it seems dependency confusion attacks are quite common. But wait a minute: If a package is found both in the public and the private registries, shouldn't the one in the private registry be used?

Or at least, there should be a switch to say "If similar names are found in both the public and the private registries, use the one from my private registry?"

Isn't it a bug?

Unfortunately, no. I don't enjoy quoting this, but "It's a feature, not a bug": because dependency management is hard, sometimes so difficult as to border on philosophy.

At first glance, a package manager seems simple: It's just a program that installs stuff that you defined (via CLI or a text file). But it actually does (a lot) more than that: It makes sure all of the packages we installed work together.
conflict-backtrack.png
For example, I want to install a package named coffee v1.0, which declares that it requires a few dependencies: beans, grinder, filter, and many other things. Since coffee depends on beans and grinder, it installs the latest versions of them, for example, both at v2.0. Then the package manager finds out that beans v2.0 requires filter > v2.0, but grinder v2.0 requires filter < v2.0. That is a dependency conflict, and it's the package manager's job to solve it. It might try backtracking: don't install the latest beans v2.0, maybe try beans v1.9 first, because it might not require such a high version of filter incompatible with grinder v2.0's requirement. If that doesn't work, maybe try an earlier version?

As you can see, although there are only a few dependencies, it's already messy enough. What's more, the example above hardly begins to describe the order of magnitude of the dependency management problem because there are way more other things to consider: What if I'm using a specific version of Python, like Python 3.8? What if I'm running on the Windows platform instead of Linux? What if there is a built package, and what happens when there isn't? ...

You get the idea: dependency management and dependency resolution algorithms are hard (if you are interested in this topic, give this funny blog a read).

Not only are the dependency resolution algorithms hard, but making design choices is philosophically difficult, too. If the user provides multiple registries, should all of them be treated equally? If not, to what extent should a registry be prioritized? What if the prioritized registry can't resolve dependency conflicts or there is a much better fit in an unprioritized registry? What design choice to make? I didn't make up these questions just to illustrate the point; there actually is an ongoing discussion regarding PyPI on exactly this matter, see here and here.


How to Prevent Dependency Confusion Attacks

At this point, you might feel that we are all doomed because dependency management is hard and many are vulnerable to dependency confusion attacks. Not to mention Murphy's law: Anything that can go wrong will go wrong.

Are there mitigations? Let's brainstorm.

Looking at the whole cycle of dependency confusion attacks, we can see:

  • Dependency confusion attacks originate from the public registries.
  • They can happen because of configuration and operation, no matter because of package managers' features such as treating all indices equally, or human errors like no version-pinning.
  • Although we most likely have monitoring/verification/alerting set up for our applications, we usually don't have any monitoring or checks for packages and dependencies, which further allows malicious actors to cause a bigger impact.

So, to prevent dependency confusion attacks, it's logical to take measures on all these three factors: sources, configuration/operation, and monitoring.

Perhaps the most powerful and effective way to prevent dependency confusion attacks from happening is at the source. Since they happen because of malicious packages with the same names in the public registries, we can register our private package names in the public registries ourselves one step ahead of the hackers. By squatting all the names first, no one else can hijack the same packages using dependency confusion attacks because we hijacked them ourselves first. What makes this method particularly good is that it works regardless of whether there is a server misconfiguration or human errors. So, as a rule of thumb and a best practice, every organization should register its internally developed private package names with public registries, whenever possible.

And of course, more actions can be taken regarding configuration and operation, such as using version pinning to explicitly declare package versions instead of a broad range or no version at all.

As to verifying and monitoring, we can use signed packages with digital signatures. For example, PyPI also supports package signing, it's possible to retrieve the .asc file and do a signature verification.

Automate prevention of dependency confusion attacks

Detecting risks as early as possible is the best way to protect yourself. That's why GitGuardian SCA module scans your project's dependencies by comparing them to popular public repositories. If it finds any matches between your internal packages and public ones, it will flag them as potential dependency confusion risks.


Summary

To sum up, as a new type of supply chain attack, dependency confusion attacks exploit how package managers work by publishing malicious packages to public registries with the same name as internally developed private packages. Many systems are vulnerable to it due to misconfiguration and no version-pinning, but it's not the end of the world, because it can be mitigated in many ways, the most effective one of which is registering private package names in public registries whenever possible.

If you like this article, please comment and subscribe. See you in the next piece!