GitHub Copilot Security and Privacy Concerns: Understanding the Risks and Best Practices
AI-powered code completion tools like GitHub Copilot, co-developed by GitHub and OpenAI, likely need no introduction. Developers are rapidly embracing this evolving technology to aid them in their work. Copilot and other Large Language Model (LLM) based coding assistants suggest lines of code and entire functions, saving a lot of time looking up specific formulations or helping scaffold common structures in the code. It is also great for helping a team write documentation.
However, is Copilot secure? As beneficial as it may be, it also comes with significant security and privacy concerns that individual developers and organizations must be aware of. As Frank Herbert put it in "God Emperor of Dune" (the 4th book in the Dune saga):
"What do such machines really do? They increase the number of things we can do without thinking. Things we do without thinking–there's the real danger."
The first step in protecting yourself and your team is to understand the pitfalls to avoid as we leverage these handy tools to help us all work more efficiently.
How GitHub Copilot is trained
To better understand what to guard against, it is important to remember how data ends up in these LLMs. GitHub Copilot ingests a large amount of training data from a wide variety of sources. This is the data it is referencing as it answers user prompts. These training sources include all public GitHub repositories' code and, essentially, the whole of the public internet.
Importantly, Copilot also learns from the prompts users input when asking questions. If you copy/paste code or data into any public LLM, you are encouraging the AI to share your work. For open-source projects or public information, there is not a lot of danger here on the surface, as it has likely trained on this already. But here is where the danger really starts for internal and private code and data.
Security concerns with GitHub Copilot
Here are just a few of the issues to be aware of and to watch out for as you leverage any code assist tool in your development workflow.
Potential Leakage of secrets and private code
GitHub Copilot may suggest code snippets that contain sensitive information, including keys to your data and machine resources. This is at the top of our list as it means an attacker can potentially leverage Copilot to gain an initial foothold.
While some safeguards are in place, clever prompt rewording can yield suggestions that contain valid credentials. This is a very attractive path for attackers looking for ways to gain access for malicious purposes.
Attackers are also looking for clues about your applications and environments. If they learn you are using an outdated version of some software, especially a component in your application with a known, easily exploited flaw, then that is likely an attack path they will attempt to exploit. While more time-consuming to execute than using a discovered API key, this is still a serious concern for any enterprise.
Insecure Code Suggestions
While we would love to say ChatGPT and Copilot only ever suggest completely secure code and configurations, the reality is the suggestions will only ever be as good as the data they are trained on. By definition, Copilot is an average of all developers' shared work. Unfortunately, all the security failings added to all known public codebases are part of the corpus on which it bases its suggestions.
The data it is trained on is also aging rapidly and can't keep up with the latest advances in threats and vulnerabilities. Code that would have been fine even a couple of years ago, thanks to new CVEs and new attack techniques, is sometimes just not up to modern challenges.
Poisoned data can mean malicious code
Recently, a research team uncovered a method of injecting hard-to-detect malicious code samples used to poison code-completion AI assistants to suggest vulnerable code. Attackers working to lure developers into using purposefully insecure code is not a new phenomenon, but attackers are coming to rely on developers simply trusting the code suggestions from their friendly Copilot and not overly scrutinizing it for security holes. On the other hand, finding and using a random code sample on StackOverflow would likely give every developer pause, especially if heavily downvoted.
Package Hallucination Squatting
One of the more disturbing problems across all AIs is that they simply make things up. When asking trivial questions, these hallucinations can be rather entertaining at times. When writing code, this issue can be rather annoying and, increasingly, rather dangerous.
In the best of scenarios, the package that Copilot suggests simply does not exist, and you will need to find an alternative. This pulls you out of your flow and wastes your time. One researcher reported that as much as 30% of all packages suggested by ChatGPT were hallucinated.
Attackers are well aware of this issue and have begun leveraging it to find commonly suggested hallucinations and register those packages themselves. The most clever of them will clone similar packages that perform the functionality Copilot describes and then hide malicious code within, counting on the developer not to look too closely. This practice is similar to typosquatting; hence, the security community has dubbed this issue "hallucination squatting."
Lack of Attribution and Licensing
One of the more commonly overlooked issues with code suggested by any LLM is understanding the licensing of suggested code. When Copilot generates code, it does not always provide clear attribution to the original source. This does not pose an issue for permissive licenses like Apache or MIT. But what if you inject a copyleft-licensed bit of code, such as the GPL, which demands that the inclusion of this code makes the entire codebase open-source? What do your legal and compliance teams say about this? If you doubt whether you can include the code, it should likely be left out of your project.
Privacy Concerns with GitHub Copilot
In addition to the security concerns we have already covered, privacy is another class of concern to be addressed. Privacy laws differ between jurisdictions, but these issues affect our users, the very folks we want to work to keep safe.
Sharing private code
As mentioned before, GitHub Copilot collects data on user interactions, including the code that users write and how users respond to the suggestions it generates. While the goal is to help refine the model and give everyone a better experience, for developers working on sensitive or proprietary projects, it raises some very serious privacy concerns. Your organization may not want its code or development practices to be analyzed or stored by GitHub, even if it is for improving AI performance.
Retention of User Data
The community has many questions about how long LLMs retain user data, how it is stored, and what specifically is in there. Companies go to great lengths to secure user data and keep it safe. Using real data to build a query is a temptation for developers, especially if you can just upload a `.zip` folder and ask AI to generate the needed code to run analytics or transform it for another use. Sharing this data might also directly violate regulations like GDPR or CCPA.
Using GitHub Copilot safely
Despite all these concerns, GitHub Copilot can still be a very valuable tool if used cautiously. Here are our suggestions for avoiding these common security and privacy risks.
Review Code Suggestions Carefully
Just as you would probably not run random, untested code, even locally, you should scrutinize any suggested code from Copilot or any other AI assist tool. Remember to treat Copilot's suggestions as—suggestions. Read what is there carefully to see if it makes sense and with the intent to use it as a learning tool. We encourage you to always check if the suggested code lives up to your organization’s coding standards and security guidelines. Always remember it is your responsibility once the code is pushed.
Just like with any code you are reviewing, remember to use the correct tools to check for issues. You should also scan for any malicious packages throughout your applications' dependencies.
Developers can leverage ggshield, the GitGuardian CLI, and pre-commit hooks to check for malicious dependencies and any known vulnerabilities before they commit the code–before it can ever reach production.
Avoid using any secrets in your code
Acknowledging that, according to their documentation, GitHub Copilot for Business does not train on your private code, it’s still crucial not to share your secrets anywhere if possible. You might think it would be hard or intentional to copy/paste your credentials into an AI-assist tool. However, if you have integrated Copilot into your IDE or editor, it is always reading your code and trying to anticipate what you need next. The only true way to prevent any secrets from leaking into a code assist tool, or anywhere else, is to eliminate any plaintext credentials from the code.
Finding and helping teams eliminate secrets is exactly what the GitGuardian Secrets Detection Platform has been helping teams accomplish for years. While there are multiple approaches to how to best store and access secrets securely, the first step is to identify what secrets exist and plan your course of action to eliminate them.
Tune your Copilot privacy settings
GitHub provides settings that allow users to control some aspects of data sharing with Copilot. Review and configure these settings to minimize data sharing where possible, especially in environments where privacy is a significant concern.
Train developers on Security Best Practices
Developers are on the front lines, delivering features and applications at an ever-increasing rate. We must work to ensure that any developer using Copilot is aware of the threats and trained on your organization's security best practices. Developers, especially less experienced developers, need to understand the potential risks of relying too heavily on AI-generated code.
We need to find a balance, though, and not simply discourage all use of Copilot, as AI assist tools are not going away and are only likely to gain wider adoption in the near future. Security needs to get away from being the 'department of no,' needing to become known as the team that empowers developers to work more safely and efficiently overall.
Good Copilot practices are good code practices
GitHub Copilot is an increasingly valuable tool that can significantly speed up the coding experience and reduce some of the toil developers face daily. We need to remember it is not without its security and privacy challenges. Developers and organizations need to deliberate on how they adopt and use Copilot. As with any new technology, the key lies in balancing the benefits with the potential drawbacks, making informed decisions, and prioritizing security and privacy at every step.
Like with any code, you should always be careful not to include sensitive information, like customer data or plaintext credentials in your prompts. This is especially true when leveraging public LLMS.
No matter where it originated, teams should take steps to review and scan AI-generated suggestions to find issues before the code is committed. GitGuardian is here to help with our Secrets Detection.