Secrets Sprawl and AI: Why Your Non-Human Identities Need Attention Before You Deploy That LLM
It seems every company today is excited about AI. Whether they are rolling out GitHub Copilot to help teams write boilerplate code in seconds or creating internal chatbots to answer support tickets faster than ever, Large Language Models (LLMs) have driven us into a new frontier of productivity very rapidly. Advancements like retrieval-augmented generation (RAG) have let teams plug LLMs into internal knowledge bases, making them context-aware and therefore much more helpful to the end user.
However, if you haven’t gotten your secrets under control, especially those tied to your growing fleet of non-human identities (NHIs), AI might speed up your security incident rate, not just your team's output. Before you deploy a new LLM or connect Jira, Confluence, or your internal API docs to your internal chat-based agent, let’s talk about the real risk hiding in plain sight: secrets sprawl and the world of ungoverned non-human identities.
Non-Human Identities And The Secrets They Hold
NHIs are everywhere in modern DevOps and cloud-native environments. Also known as machine identities, these are digital references used for machine-to-machine access. They can take a lot of different forms, such as service accounts, API keys for CI/CD pipelines, containers running microservices, or even AI agents accessing vector databases or calling APIs. They exist to move data, run tasks, and interact with other systems.
Each NHI requires credentials, or secrets, of some form to authenticate and gain the needed access to perform the needed work. Unlike people who can use multifactor authentication methods of FIDO-based passwordless approaches to ensure they are really the correct user of a system, NHIs mostly rely solely on a secret itself to connect.
Those secrets tend to sprawl across repos, cloud environments, collaboration tools, and knowledge bases. GitGuardian’s 2025 State of Secrets Sprawl report revealed that over 23.7 million secrets were leaked in public GitHub repos. That is not cumulative; that was the number added in just the year 2024.
The report also showed that more than 58% of secrets were generic, meaning they did not map to a specific known service or platform. These 'generic' secrets are most commonly used by internal services and homegrown NHIs.
Compounding the issue of secrets sprawl, NHIs are rarely tied to a single human user. Unlike employees or end users, there often is no offboarding plan for these NHIs. For a lot of systems, that also means their secrets keep on living, essentially, forever. Since NHI access levels must be set up front for most systems, there is also a tendency to widely scope the rights of these identities to allow them to do a range of things, instead of following the principle of least privilege to limit the scope to just barely what is needed.
No organization wants any secrets to leak, especially those tied to NHIs, but , this is exactly what can happen in a hasty LLM deployment.
When RAG Retrieves A Secret
Early AI models were very limited in what they could actually do, bound only to what topics or specific data sets they were trained on. Retrieval-augmented generation (RAG) removes this limitation by allowing the LLM to go get additional data as needed when prompted. Many companies are rushing to make their internal data sources available to agentic AI tools. Ideally, this would just expose the needed knowledge and nothing else. However, this is where things can go wrong. For example, let's walk through an example RAG implementation:
- An internal user asks the AI assistant chatbot, “How do I connect to our internal dev environment?”
- The LLM checks Confluence or Jira for relevant documents.
- It finds an old page with a still valid hardcoded password— "root:myp@ssword123!"
- The LLM includes that page in its context and says: “You can connect using the following credentials…”
That is less than ideal, even if the user is a developer who is hurriedly trying to get their project deployed. It is even worse if the user was an unauthorized attacker trying to steal whatever they can find after breaching your perimeter. The core of the issue is that our data source documents weren’t built with AI or secrets in mind. Unlike with code and developer workflows, there are no safeguards in place to prevent someone from adding API keys, login instructions with passwords, or even full-blown database connection strings. This effectively turns your chatbot into a very friendly and helpful internal secrets-leaking engine.
Given that NHIs outnumber humans at least 45 to 1, it is highly likely that any secret leaked in this way belongs to a non-human identity. Maybe no one ever rotated it. Maybe knows it is even there. Now it’s surfaced by your AI, logged, and exposed.
Logging and Feedback Loops Exposing Secrets
Adding to the risks from RAG finding secrets in source documents, AI engineers and machine learning teams can just as easily leak NHI credentials while trying to build observability into these systems. Since we cannot see what is going on inside the models at runtime, we need to log everything from the initial prompt, the retrieved context, and the generated response to tune the system.
If a secret is exposed in any one of those logged steps in the process, now you’ve got multiple copies of the same leaked secret. While this would be worrying enough if your logs remained internal to your organization, most dev teams rely on third-party logging tools, meaning your secrets are no longer just in your servers.
Unfortunately, in many organizations, engineers store logs in cloud buckets or local machines that are not governed by the usual security controls. Anywhere along the logging pipeline where they might be intercepted or read by an attacker is now a potential spot where a secret could be compromised. And if you’re using a third-party LLM (like OpenAI), you may have zero visibility into where those logs go.
NHI Governance: The Foundation for Safe AI
The reality is you can’t secure AI unless you get a handle on the NHI secrets in the data sources that power it.
At GitGuardian, we’ve seen firsthand how secrets sprawl explodes when organizations connect AI to real-world systems. That’s why we’re building solutions that address both sides of the problem:
- Secret Security: Scanning code, config, docs, and logs to detect and clean up exposed secrets—wherever they are.
- NHI Governance: Mapping, tracking, and controlling non-human identities so you know:
- What secrets they’re using
- Where the secrets are stored
- What systems have access
- If the permissions are tightly scoped
- Whether those secrets have been rotated or revoked
This isn’t just about preventing AI from leaking secrets—it’s about understanding the web of machine identities inside your organization and making sure each one is secure, justified, and monitored.
Before You Deploy That Next LLM, Get Ahead of the Sprawl
If you're deploying AI today, or planning to soon, there are a few key things you can do right now to get ahead of the risk:
- Scrub Sources Before You Connect: Scan and clean every knowledge base you plan to use with RAG. Confluence, Jira, Slack, internal wikis. Treat them like code; secrets don’t belong there.
- Inventory Your NHIs: Build a list of your non-human identities: service accounts, bots, agents, pipelines. Track what secrets they use, and who owns them.
- Vault Everything: Move secrets out of code and into secrets managers. Use tools like HashiCorp Vault, CyberArk, or AWS Secrets Manager. Make sure rotation is enforced.
- Monitor And Sanitize AI Logs: Treat AI system logs as sensitive infrastructure. Monitor them. Sanitize them. Audit them regularly.
- Use Role-Based Access to RAG: Restrict what documents can be retrieved based on user roles and document sensitivity. Just because it’s in your knowledge base doesn’t mean the chatbot should share it with anyone who asks.
The Future Of AI Is The Future Of Machine-to-Machine Communication
The adoption of AI brings some amazing promises. RAG is making it so much more powerful. But in this new landscape, machines are talking to machines more than ever. And those machines, your NHIs, are now accessing and potentially exposing your data and introducing new operational risks.
Don’t just secure your secrets, though that is undoubtedly part of the solution. The time has come to govern your non-human identities. Track them. Map them. Understand how they interact with your AI stack.
Because the real secret to secure AI isn’t just smarter models—it’s smarter identity management.
Want to learn more about how to build secure AI pipelines with strong NHI governance? Explore GitGuardian’s NHI Governance platform.