Imagine being a security scanner, living your best life searching through millions of lines of code. Your job? Catching sneaky developers who left their AWS keys hanging around in plain sight. You're basically a digital detective, but instead of looking for fingerprints, you're hunting for strings that look like AKIA... 🕵️‍♂️

Let me tell you why being a secret scanner is harder than finding a semicolon in JavaScript. When we're hunting for secrets in code, we're playing a high-stakes game of "Is This a Secret or Is This a String That Just Looks Like One?"

Well done, you just touched on one of secrets detection's cornerstones - the classic classification problem!

Here's what our confusion matrix looks like in the wild:

Predicted: SecretPredicted: Not Secret
Actual: SecretFound a real AWS key! 🎉Missed a real AWS key 😱
Actual: Not SecretFalse alarm on TEST_KEY=123 🤦Correctly ignored hello_world 👍

Want to have a closer look? Introducing the Precision vs Recall Security Zine!

Precision and Recall in secrets detection

Zine summary

  • Sure, accuracy sounds great on paper. "Our scanner is 95% accurate!" Cool story, but here's the catch: in a typical codebase, real secrets are about as rare as a bug-free JavaScript framework. If our scanner just said, "Nope, no secrets here!" for every file, it would be 99.9% accurate... and 100% useless.
  • Precision measures how often we're right when we yell "SECRET!" Low precision is like that one security tool that flags every string containing "key" – including your perfectly innocent keyboard_layout variable. Soon, developers start treating your alerts like Windows Update notifications: "Yeah, yeah, I'll look at it later" (narrator: they never did).
  • Recall answers the question "Did we find ALL the secrets?" Low recall is like having a security guard who's really good at catching people wearing red shirts but completely misses the guy walking out with the server under his arm because he wore blue.
  • The GitGuardian way: cast a wide net, filter with style. Using a combination of specific and generic detectors, GitGuardian secrets detection engine is able to identify a wide range of assignements. Then, the engine discards false positives using handmade ultra-efficient data filtering pipelines and machine learning techniques to remove nearly all false positives.
💡
👍 Security Zines is a project led by Rohit Sehgal. Check out his work at securityzines.com/#comics and give him a follow on Twitter @sec_r0 to see what he comes up with next!