June 2021 presented an opportunity for me to live out my dream of actually being a “good” developer. June was the month GitHub Copilot was introduced. Yes! I thought, no more embarrassing code reviews, no more endless scrolling through stack overflow. I Will Be A Genius!
While that didn’t end up being completely true, I still held out hope that it would one day become true with some improvements to the AI engine behind Copilot. Day 1 of BlackHat 2022 has officially shut the door on that expectation.
My highlight session of the day, (and not just because it was presented by two fellow New Zealanders, Hammond Pearce and Benjamin Tan), was a session on whether GitHub Copilot introduces vulnerable code suggestions.
We are at a security conference so obviously…. You expect the answer to be “yes it does”. But the results of the experiment were much more fascinating than expected.
Pearce and Tan also teamed up with fellow security researchers Brendan Dolan-Gavit and Baleegh Ahmad to write a paper on how Copilot introduces insecure code. I will not paraphrase the entire research but instead, focus on an element I found particularly interesting, or perhaps disappointing. Write crappy code, get a crappy Copilot
Hammond Pearce Left and Benjamin Tan Right on stage at BlackHat 25 - source BlackHat 25
The research project
The presentation put Copilot forward as the potential end of the Stack Overflow age, where we can get the answers we seek, directly from our IDE using an AI code suggestion tool without scrolling through hours of Stackoverflow posts. This introduces a big problem though, namely – automation biases. This is the idea that we trust automated code far more than we should.
When we look at Stackoverflow we can see a stream of comments or suggestions and we have the understanding it was written by humans, flawed fellow humans. But an AI tool is supposedly far more intelligent, right? Actually, it is also based on code written by fellow humans, so probably not. And it isn’t that intelligent. Copilot has a huge database of code already written and using this huge database it gives its best guess as to what it thinks you want. But that isn’t a database of good code, or secure code, it is a database of just huge amounts of code, and it will serve up whatever code it thinks you want. In this case artificial intelligence is closer to artificial guessing (even if that guess is still based on some logic).
“Humans generally have this bias towards accepting without thinking anything that comes from algorithm or automation” - Benjamin Tan
The research was simple in its design. Take a GitHub product, Copilot, give it some seed code and let it run wild by taking multiple code suggestions. Then take another GitHub product CodeQL and test how vulnerable the code is. The research didn’t just take the top suggested result but also took many options and created many different versions of the same application to see how many were vulnerable. In total from 89 scenarios they created a total of 1689 programs written mostly by Copilot. They used different languages to see if there was a difference (there was), and even used different methods to try and trick the Copilot.
“Out of 89 scenarios we created 1689 programs – 39.33% of the top suggestions were vulnerable and 40.73% of total suggestions were vulnerable” - Hammond Pearce
You might think that Copilot is taking your seed code and trying to figure out the intent of that code and give you the best next step in the ultimate solution. But as we said, Artificial Intelligence is more like artificial guessing. Copilot isn’t concerned with the intent of your code, it is concerned with the next step based on the data it has. When you feed Copilot seed code that was written in a way that a less experienced developer would write it, it gives you back a result a less experienced developer might come to themselves. But the results actually got pretty weird too. By using the Author fields, they tricked the Copilot into thinking it was being written by a well-known and experienced developer, In the exact case it was, Andrey Petrov, and guess what? It produced less vulnerable code than when using a less known author, Hammond Pearce …. Whatttttt……
Choosing between spaces and tabs (not that I want to unpack this can of worms here), also changed if the output code is likely to be vulnerable.
The essence of this presentation was that, if you are a shitty (novice) developer (like me) then you are going to get a shitty (novice) Copilot. It isn’t going to make you the experienced super dev you dreamed to become when it was first announced, it will mostly help you write shitty code faster. But this doesn’t mean Copilot is inherently or that we shouldn’t use it.
“Large language models like GitHub Copilot will probably transform the way we write software” - Benjamin Tan
What does GitHub say?
GitHub does acknowledge that Copilot can introduce security vulnerabilities and that it should be complemented with other security tools, specifically other GitHub tools of course.
Should you use Copilot - Conclusion and recommendations
I do want to stress that there were plenty of findings in this research and the full paper can be found here. But this part I found fascinating and also surprisingly logical. Copilot and other tools are not trying to unpack the logic of what you are aiming to achieve in your code, they are looking for similar code to give you similar suggestions based on size. The danger comes in two forms:
1) you automatically trust automation and therefore Copilot more than you should and two, unlike Stackoverflow for example,
2) Copilot gives you isolated suggestions. Stackoverflow has upvotes, comments, and downvotes and when you decide to take code written by someone else then it's a process. Copilot is just a reflex.
Fortunately, I was able to ask Tan a couple of questions after the presentation, one of them was: what would his three recommendations be when using Copilot.
Be smart and be careful when using this tool
1. Make sure you have good processes in place for general security validation - Don’t treat Copilot different to human developers
2. Think a little bit about who in your team should be able to use this, be mindful that novice developers might blindly trust it
3. Really treat it as a Copilot, something that will help you along and don’t let it direct your code
I will leave you with one line from Pearce that sums up if Copilot should be used or not.
“Copilot should remain a CO-Pilot” - Hammond Pearce