Imagine being a security scanner, living your best life searching through millions of lines of code. Your job? Catching sneaky developers who left their AWS keys hanging around in plain sight. You're basically a digital detective, but instead of looking for fingerprints, you're hunting for strings that look like AKIA...
🕵️♂️
Let me tell you why being a secret scanner is harder than finding a semicolon in JavaScript. When we're hunting for secrets in code, we're playing a high-stakes game of "Is This a Secret or Is This a String That Just Looks Like One?"
Well done, you just touched on one of secrets detection's cornerstones - the classic classification problem!
Here's what our confusion matrix looks like in the wild:
Predicted: Secret | Predicted: Not Secret | |
---|---|---|
Actual: Secret | Found a real AWS key! 🎉 | Missed a real AWS key 😱 |
Actual: Not Secret | False alarm on TEST_KEY=123 🤦 | Correctly ignored hello_world 👍 |
Want to have a closer look? Introducing the Precision vs Recall Security Zine!
Zine summary
- Sure, accuracy sounds great on paper. "Our scanner is 95% accurate!" Cool story, but here's the catch: in a typical codebase, real secrets are about as rare as a bug-free JavaScript framework. If our scanner just said, "Nope, no secrets here!" for every file, it would be 99.9% accurate... and 100% useless.
- Precision measures how often we're right when we yell "SECRET!" Low precision is like that one security tool that flags every string containing "key" – including your perfectly innocent
keyboard_layout
variable. Soon, developers start treating your alerts like Windows Update notifications: "Yeah, yeah, I'll look at it later" (narrator: they never did). - Recall answers the question "Did we find ALL the secrets?" Low recall is like having a security guard who's really good at catching people wearing red shirts but completely misses the guy walking out with the server under his arm because he wore blue.
- The GitGuardian way: cast a wide net, filter with style. Using a combination of specific and generic detectors, GitGuardian secrets detection engine is able to identify a wide range of assignements. Then, the engine discards false positives using handmade ultra-efficient data filtering pipelines and machine learning techniques to remove nearly all false positives.