Precision & Recall [Security Zines]

Imagine being a security scanner, living your best life searching through millions of lines of code. Your job? Catching sneaky developers who left their AWS keys hanging around in plain sight. You're basically a digital detective, but instead of looking for fingerprints, you're hunting for strings that look like AKIA... 🕵️‍♂️

Let me tell you why being a secret scanner is harder than finding a semicolon in JavaScript. When we're hunting for secrets in code, we're playing a high-stakes game of "Is This a Secret or Is This a String That Just Looks Like One?"

Well done, you just touched on one of secrets detection's cornerstones - the classic classification problem!

Here's what our confusion matrix looks like in the wild:

	Predicted: Secret	Predicted: Not Secret
Actual: Secret	Found a real AWS key! 🎉	Missed a real AWS key 😱
Actual: Not Secret	False alarm on `TEST_KEY=123` 🤦	Correctly ignored `hello_world` 👍

Want to have a closer look? Introducing the Precision vs Recall Security Zine!

Precision and Recall in secrets detection

Zine summary

Sure, accuracy sounds great on paper. "Our scanner is 95% accurate!" Cool story, but here's the catch: in a typical codebase, real secrets are about as rare as a bug-free JavaScript framework. If our scanner just said, "Nope, no secrets here!" for every file, it would be 99.9% accurate... and 100% useless.
Precision measures how often we're right when we yell "SECRET!" Low precision is like that one security tool that flags every string containing "key" – including your perfectly innocent keyboard_layout variable. Soon, developers start treating your alerts like Windows Update notifications: "Yeah, yeah, I'll look at it later" (narrator: they never did).
Recall answers the question "Did we find ALL the secrets?" Low recall is like having a security guard who's really good at catching people wearing red shirts but completely misses the guy walking out with the server under his arm because he wore blue.
The GitGuardian way: cast a wide net, filter with style. Using a combination of specific and generic detectors, GitGuardian secrets detection engine is able to identify a wide range of assignements. Then, the engine discards false positives using handmade ultra-efficient data filtering pipelines and machine learning techniques to remove nearly all false positives.

💡

👍 Security Zines is a project led by Rohit Sehgal. Check out his work at securityzines.com/#comics and give him a follow on Twitter @sec_r0 to see what he comes up with next!