ChatGPT came into 2023 with ferocious force, but AI and large language models are not all that new, before I was writing articles about security and ChatGPT I was writing them about GitHub Co-Pilot. There is no doubt that large language model tools like ChatGPT are here to stay. But as more and more software engineers use chatGPT, what can we expect? And more specifically, what are the security risks associated with it? In this article, we are going to discuss why ChatGPT needs to be considered a part of your organization's attack surface even if you don’t use it and break down some of the biggest security concerns right now.

Why ChatGPT needs to be part of your threat landscape

Before delving deeper into ChatGPT, let's take a moment to discuss the security concerns surrounding GitHub, as there are similarities between the two platforms. GitHub is used by nearly 100 million developers, it hosts their side projects, They contribute to open-source projects, use it to educate themselves, and of course, showcase their portfolios. However, GitHub is also a place where sensitive information can easily be leaked. The GitGuardians 2023 State of Secrets Sprawl report revealed that over 10 million secrets, such as API keys and credentials, were exposed in public repositories in 2022 alone. A Lot of these secrets actually belonged to organizations but were leaked through personal or unconnected accounts. For instance, Toyota, although they don’t use GitHub themselves, a contractor accidentally leaked database credentials associated with a Toyota mobile application in a public GitHub repository. This raises concerns about ChatGPT and other LLM tools because like GitHub, even if your organization isn't using ChatGPT, your employees almost certainly still are. Within the developer community, there's a palpable fear of falling behind if you don't use such tools to boost productivity. However, again like GitHub, we have limited control over what employees share with ChatGPT, and there's a high chance that sensitive information may be stored on the platform, potentially resulting in a leak.

In a recent report, data security service Cyberhaven detected and blocked requests to input data into ChatGPT from 4.2% of the 1.6 million workers at its client companies because of the risk of leaking confidential information, client data, source code, or regulated information to the LLM.

Source DarkReading

One of the best metrics we have on what tools developers are using is, ironically, measuring the number of secrets leaked on Public GitHub. According to the GitGuardian 2023 State of Secrets Sprawl Report,  OpenAI API keys saw a massive increase towards the end of 2022 along with mentions of ChatGPT on GitHub showing a clear trend in developers actually using these tools.

Graph from State of Secrets Sprawl report for 2023

The risk of Leaking data

Wherever there is source code, there are secrets, and ChatGPT is often utilized as a sort of code assistant or co-writer. While there have already been instances where ChatGPT experienced data breaches, such as accidentally exposing query history to unrelated users, the more concerning issue is the storage of sensitive information in ways that are completely inappropriate and insecure for their level of sensitivity.

Storing and sharing sensitive data like secrets should always be done with a heightened level of security, including high-quality encryption, tight access control, and granular logs showing where, when and who accessed the data. However, ChatGPT is not designed to handle sensitive information, as it lacks encryption, strict access control, and access logs. This is similar to the use of git repositories, where sensitive files can often end up despite the lack of sufficient security controls. This means that sensitive information is left in an unencrypted database that is likely to be a hot target for attackers.

In particular, personal ChatGPT accounts that employees may use to avoid detection from work have weaker security and a complete history log of all the queries and code entered into the tool. This could be a treasure trove of sensitive information for attackers, posing a significant risk to organizations regardless of if they allow or use ChatGPT for their day-to-day operations.

Tricking ChatGPT

So we have talked about leaking sensitive data to chatGPT and about data leaking from ChatGPT. This is also a big concern. ChatGPT will prevent you from obtaining sensitive information if you directly ask for it by responding with a generic cookie-cutter answer if you try. But ChatGPT is remarkably easy to trick. In the below example, we asked them to give us some AWS credentials and it refused, but then we fluffed out the query to make it seem less malicious we got the answer we wanted.

Asking ChatGPT for a credential 
Asking ChatGPT for example credentials

About half of the tokens it provided as examples contained the word EXAMPLEKEY, but about half didn’t. It does lead to the question though of where are these keys coming from. All these keys fit the AWS format including,  character set and length, and entropy base (excluding the example text). Does ChatGPT understand how these keys are made or is it altering keys it found in its dataset? More likely the latter.  ChatGPT uses the Common Crawl data set, which is a publicly available web corpus that contains over a trillion words of text from a diverse range of sources on the internet. Included in this data set is source code from public repositories on GitHub, as we know GitHub has a huge amount of sensitive information included in it. We know that when GitHub first released Copilot it was possible to get it to give you some API keys and credentials as suggestions. We know chat GPT changes its answers significantly depending on how you ask something so it could be that asking it the right questions gets it to spit out sensitive information from the Common Crawl data set.

ChatGPT isn’t a very good software engineer

The other element of security concern with ChatGPT is one I discussed when Co-Pilot was stealing all the jobs before ChatGPT. Diving into some research I discovered the concept of AI biases, that is that we trust AI much more than we should. Like a friend that is very confident in their answers so you assume they are right until you finally discover they are an idiot that likes to talk a lot (in walks ChatGPT). ChatGPT often gives you example code that is completely unsecure and unlike forums like StackOverflow there is no community to warn you against it. For example when you ask it to write code to connect to AWS it hard codes credentials instead of handling them in a secure way, like using environment variables.

ChatGPT hardcoding secrets

The problem is that a lot of developers will trust the solution their confident AI friend gives them not understanding that it is unsecure or why it is unsecure.

What can we do

AI will get better but it is only as good as the data it is trained on. They are trained on large datasets that may not always be of good quality. This means that they may not be able to distinguish between good and bad source code, and it is why they provide poor coding practices as examples.

Educate developers - it's important to educate developers about the limitations of AI tools like ChatGPT. Instead of banning them, we should show our developers why they are unsecure and confront them with their AI biases. By doing so, we can help them understand the limitations of the technology.

Identify and secure secrets - To prevent sensitive information from being leaked via ChatGPT, it's also important to identify secrets and reduce secrets sprawl. This involves scanning repositories and networks for secrets, centralizing them in a secrets manager, and enforcing strict access control and rotation policies. By doing so, we can reduce the likelihood that a secret will end up in ChatGPT's history.

Embrace it because it is here to stay - Finally, we should not resist the AI revolution, but instead, embrace it with caution. While leaked credentials are a legitimate concern, AI can also be a valuable tool if we understand its purpose and limitations. By taking steps to educate ourselves and secure our data, we can leverage the benefits of AI without compromising security.