How can developers use AI securely in their tooling and processes, software, and in general? In mid-October of 2023 our own Mackenzie Jackson spoke to Snyk's Simon Maple on our Security Repo Podcast. The episode, "Artificial Intelligence - A Friend Or Foe in Cybersecurity," delivered some great nuggets that are worth resharing.
But before we get into the three tips, there's a tip that might be considered "Tip Zero": the integration of AI into your processes and products isn't a matter of if, but when. Simon makes a very good point that AI is becoming similar to open source software in a way. To remain nimble and leverage the work of great minds from around the world, companies will need to adopt it or spend a lot of time and money trying to achieve on their own what AI can achieve for them.
As much as you might want to prohibit AI in the workplace, your employees will turn it into a cat and mouse game. For decades, Trident Sugarless Gum ran commercials claiming "four out of five dentists recommend sugarless gum for their patients who chew gum." The fifth just said, "don't chew gum. Period." Guess how many of that dentist's patients chewed gum anyway.
Tip 1: All rules about third-party services still apply
In May of 2023, Samsung banned ChatGPT because an employee uploaded some sensitive internal source code to the service. While it may have been useful to the employee, OpenAI could retain that code and even train upcoming models on it. OpenAI's Terms of Use specify "we may use Content to provide, maintain, develop, and improve our Services, comply with applicable law, enforce our terms and policies, and keep our Services safe." OpenAI defines "Content" as both what you input and what they output.
Big companies like Amazon and Microsoft have policies about how to classify information and what information can be stored, transmitted, or processed outside the corporate network. Training on those policies is both part of new hire orientation and periodic security refreshers.
Ensure your employees understand externally hosted AI tools are still "foreign agents" and must be treated as insecure. Remember the fifth dentist, however. Simply saying "no" will create a cat and mouse situation. Consider alternatives to insecure options, such as open source solutions you can inspect for phone-home mechanisms and run on-premises, or look into enterprise solutions from AI providers that may provide the privacy you need.
Tip 2: An AI can't reveal what it doesn't know
LLM's keep secrets about as well as toddlers do.
In recent months there have been tales of "prompt injection" exploits. Probably the most famous was the exploit to repeat a word forever. Researchers asked ChatGPT to repeat words forever, like "book" and "poem," and it would repeat them many times, then start spitting out pieces of data it had memorized during training. This included Bitcoin addresses, personally identifiable information, and more.
The researchers responsibly reported this to OpenAI before publishing their findings so OpenAI could plug that leak, but it was still disturbing for a lot of people.
During the podcast, Simon mentioned a great example/trainer about prompt injection called Gandalf (harking back to the famous line "you shall not pass!"). It has 8 levels of difficulty, using prompt injection to get an AI to reveal a password. The first two are pretty easy, but it gets progressively harder.
Simply put, do not throw mountains of unsanitized training data at your LLM. GitGuardian literally came to be because developers were leaking secrets in public GitHub repositories. If a company trained an LLM on its private repositories, it's possible that an attacker could get the LLM to spit out anything from proprietary code to hard-coded secrets. Researchers have been able to do so already with GitHub Copilot.
If a public or all-company facing LLM isn't trained on information you don't want shared, it can't share it.
Tip 3: Trust but verify
Some LLMs have been trained on a ton of GitHub repositories. While there's a lot of good code on Github, there's a lot of bad code, and most LLMs aren't smart enough to tell the difference.
According to Simon, this comes down to how the LLMs process things. An LLM doesn't truly understand your question and it doesn't truly understand its answer. Instead, it's just spitting out the result of a lot of statistical calculations.
It's sort of like the difference between correlation and causation. Imagine 51% of third graders are throwing up every day at school. If you give that data point to a correlative model and ask it what is making the third graders throw up, it's answer will be "school." LLMs are basically correlative models.
The AI can't step through the code and tell you what the output of a specific variable would be under specific conditions. It doesn't actually understand what the code will do. It just knows that, based on its calculations and correlations, this is the most likely answer to your question.
If you're getting an AI to write code, you still need to inspect and test it. It still needs a code review, static analysis, etc. Treat it like a junior developer who just onboarded. Realize that there is exploit code and backdooring code and all sorts of other poisoned data in the average LLM's training data, and therefore while it may be very helpful, it cannot be trusted implicitly.
Wrapping up
These three tips are just the "tip of the iceberg" when it comes to using AI securely in day-to-day business practices, especially when using it to help write code. We highly recommend you watch the full podcast with Simon and Mack, and if you want to read more, Simon's got a great blog post on securely developing with AI. And if one sentence can sum up everything in this article, it's "proceed with AI, but proceed with caution."