As a site reliability engineer in a global company, I'm running a modern (well, relatively modern, to be honest and modest) cloud-native stack: HashiCorp Vault as the secret manager, workloads on Kubernetes clusters in AWS (EKS), and development workflows automated through Jenkins (legacy) and GitLab CI. This setup is, quite likely, familiar to you — it's the normal playbook in the cloud-native era.

In theory, we have the right tools for both security and efficiency: After all, we have a state-of-the-art secret manager integrated with everything. But in reality, it's far from the truth. See if you resonate with the following scenarios:

Scenario A: A new colleague just joined the team.

Manager: "Your initial password to log in to your corporate account came to me via email, but since you can't log in to your mail account just yet, here, take a picture of my screen." (In some companies, taking a picture of a computer monitor would get you fired, I'm not kidding.)

Scenario B: A developer needs a temp password to access a database.

  • Dev: "Where is the newly created temporary password? Need it for debugging."
  • Ops: "In the Vault."
  • Dev: "I can't access Vault."
  • Ops: "No, you can't. It's not safe to open UI access to Vault. Corporate policy."
  • Dev: "Then how can I get the password?"
  • Ops: "Well... Technically, the password isn't in the Vault. There is a Jenkins pipeline that calls the Vault API to generate a temp password, then stores it in Jenkins secrets. You need to request access to the corresponding Jenkins pipeline, trigger it, then get the secrets from Jenkins."
  • Dev: "Why on earth do we store secrets in Jenkins when we have Vault, which we aren't allowed to use?"
  • Ops: "Corporate policy, just told you."

Scenario C: A new ops team member needs to update a certificate for a service running in production for the first time.

  • Ops: "Where is the old cert?"
  • Mentor: "In K8s as a secret."
  • Ops: "Where is the cluster?"
  • Mentor: "In AWS."
  • Ops: "How do I access that?"
  • Mentor: "You need to assume a specific role for that."
  • Ops "Which role?"
  • Mentor: "Let me check."
  • Ops: "Where is the new cert?"
  • Mentor: "Depends. Either in Vault or generated by Let's Encrypt."
  • Ops: "How is the process not automated?"
  • Mentor: "Been on the roadmap for 3 years."

You probably have already figured out where I'm going with these stories (oh, by the way, they are not made up), and you are right: the fragmented situation introduces new risks and operational challenges, even for global teams that have invested in best-in-class tools.

Today, let's have a closer look at it - from an SRE's standpoint.


1. Secret Managers Alone Aren't Enough

As our environment scales, so does the operational overhead and complexity.

Imagine this: secrets, users, and roles are being created for hundreds of developers. All of these are managed across multiple, disconnected places: Vault, Cloud IAM, Kubernetes clusters, Jenkins secrets, and GitLab secrets.

Just when you think this is already more than enough to handle, to make things worse, there comes more: besides username/password for humans, there are credentials, like API keys, roles, service accounts, certificates, and what have you, for apps and services and scripts and bots, used by thousands of machines.

Each system has its own way of handling credentials and permissions, and not everything is centrally visible (not to mention controlled).

Secret managers are a critical tool, but they are not a complete solution for all secret and identity challenges, for a number of reasons:

  • Not all credentials, roles, or secrets originate from or are managed by secret managers (for example, IAM roles are managed within cloud provider IAM systems), as we can see above in those real-world examples.
  • Some secrets, like database initial credentials or temporary access keys, are generated dynamically by cloud services and may not be automatically synced to secret managers.
  • CI systems often have their own secret storage mechanisms, and not all secret managers natively integrate or sync with these systems. This means additional silos.
  • Kubernetes Secrets: Although in an ideal world, K8s secrets should be synchronized from secret managers automatically, in the real world, they could be created from various sources. This means sensitive data may exist outside the view of our main secret management solution.
  • Non-Human Identities (NHIs): Service accounts, application identities, and other NHIs may be provisioned and managed in different systems, making it difficult to have a single source of truth.

Solely relying on a secret manager can (and will) create blind spots, because a more comprehensive approach is required for managing not only users and passwords, but also roles, certs, and all the other stuff across all systems.

2. An Introduction to Non-Human Identities (NHIs)

As shown in previous examples, there are scripts, bots, service accounts, and automation tools running in our stack. Each one needs a way to prove who it is — just like a human user — but instead of passwords, they use tokens, API keys, roles, certificates, or other types of secrets. These are Non-Human Identities (NHIs): identities for machines, apps, and services that need to talk to each other without human intervention, and chances are, they do not all live in the secrets manager.

NHIs are everywhere in modern DevOps and SRE workflows. They power CI/CD pipelines, connect microservices, sync data between clouds and datacenters, and keep our infra running. But unlike human users, NHIs don't have a face, a Slack handle, or a manager. They're spun up and torn down by code, often with little oversight, in multiple places. That's why it's important to understand where they live, what they can access, and how they're managed.

The reality is, for every engineer on the team, there are probably hundreds of NHIs quietly doing their jobs in the background. They're everywhere — across clouds, clusters, CI systems. Because they're so easy to create (and forget), they're often over-privileged but under-monitored without clear ownership.

This makes NHIs a goldmine for malicious actors. If a token or key leaks, it can open the door to sensitive systems — sometimes with more power than any single human user. And with so many NHIs scattered across the whole environment, it's easy to lose track of who (or what) has access to what. Oh, by the way, if you'd like to develop an incident response playbook for leaked secrets, read my previous blog.

If we're not actively managing NHI security — tracking where they live, enforcing least privilege, and cleaning up what we don't need — we're leaving our infra exposed. Getting proactive about NHI security isn't just a best practice; it's mandatory for keeping systems safe in a world where automation is king.


3. The Vicious Cycle of Toil and Risk

From an SRE perspective, the chaos of scattered NHIs, secrets, and roles is a significant source of toil and a direct threat to reliability. Here's why this is a critical problem that needs a playbook:

  • Lack of observability: Impossible to get a complete inventory of all identities and their permissions. This creates blind spots, increasing the risk of orphaned credentials and violating the principle of least privilege. Unused credentials accumulate, creating "zombie" identities that expand the attack surface.
  • Increased security toil: Manually tracking, rotating, and auditing secrets across multiple systems is repetitive and error-prone. This operational overhead detracts from engineering efforts that could improve site reliability. Without a central view, enforcing security practices like rotation or least privilege becomes a manual, best-effort, easy-to-forget process, making it impossible to define, measure, and meet security SLOs.
  • Degraded incident response: In the event of a breach, the time to resolution (TTR) is significantly higher. The lack of a clear inventory means we can't answer the most critical questions: "What is compromised? What can it access? How do we revoke it?" SREs must manually find and revoke credentials across disconnected systems, making a quick response nearly impossible.
  • Compliance: Proving compliance against corporate policies and security standards becomes a high-hanging fruit if we don't even have an overview.

When we can't oversee the lifecycle of NHIs and secrets, we enter a vicious cycle where toil increases, and reliability degrades. One of the core SRE principles is to automate away toil. Centralizing the management and observability of these NHIs is the first step to reducing risk.


4. The NHI Playbook: Establish a Central Plane for Observability

To address the fragmentation of NHIs and secrets, it's essential to establish a central platform that provides observability across all systems. An effective approach is to create a dashboard or an internal developer portal that serves as the source of truth. It should:

  • Aggregate data for full observability: Integrate with all sources — cloud IAMs, secret managers, CI/CD platforms, and Kubernetes — to collect and display every identity, secret, and role in one place.
  • Provide visibility: Offer a single plane for SREs and security teams to audit all credentials and permissions.

One option is to custom-build a portal tailored to our tools and environments. By using APIs provided by different tools and by using AI and MCP servers, we can quickly get started. For example, we can use or extend open-source developer portal frameworks (e.g., Backstage), or we can write scripts to collect and develop a web page to show them.

In this repository, I created some scripts to collect NHIs from different sources (AWS IAM, Vault, K8s, and GitLab CI).

For example, I can get all roles in IAM and see if they are assumed by any service.

    iam_client = boto3.client("iam")
    nhi_roles = []

    paginator = iam_client.get_paginator("list_roles")
    for page in paginator.paginate():
        for role in page["Roles"]:
            assume_role_policy = role.get("AssumeRolePolicyDocument", {})
            if assume_role_policy:
                statements = assume_role_policy.get("Statement", [])
                for statement in statements:
                    principal = statement.get("Principal", {})
                    if "Service" in principal:
                        nhi_roles.append(
                            {
                                "RoleName": role["RoleName"],
                                "Arn": role["Arn"],
                                "ServicePrincipal": principal["Service"],
                            }
                        )
# A role is considered an NHI if it can be assumed by any service.
                        
# So we can break after finding the first service principal.
                        break

For another example, I can get variables from projects and see if secrets for machines/CI are there:

for project in projects:
        variables = project.variables.list()
        if variables:
            found_secrets = True
            print(f"\nProject: {project.name_with_namespace}")
            for var in variables:
                # Do not print var.value, as it is a secret
                print(
                    f"- Variable Key: {var.key}, Scope: {var.environment_scope}"
                )

    While this solution isn't trivial, centralizing visibility and control is a critical step toward risk reduction.


5. Why the DIY Solution Might Not Be For You

I wouldn't go so far as to say that building a custom tool for managing NHIs is "anti-pattern", but sometimes it is, because it often creates more toil than it solves. Here's why the DIY approach is a lost game from both a technical and operational standpoint:

  • More than just API calls: In the previous example, I provided four separate scripts for collecting NHI info from different sources. How about merging them into a microservice? Shall I build a Docker image and a Helm chart so that it can be deployed in a K8s cluster? Then it becomes a real service that needs to be version-controlled and lifecycle-managed. This is essentially committing to a permanent maintenance cycle.
  • More than just backend: Now the team is responsible for a full-stack product with a frontend page, and that's not trivial even if we use some open-source frameworks. And how about integrating it with the corporate SSO? Then a side project becomes a full-on engineering investment.
  • Engineering capacity: Every hour spent building and maintaining this internal tool is an hour not spent on other, more important stuff.
  • And how about the reliability of the platform itself?

The bottom line is that only you can decide if an in-house tool is the right choice for you. A DIY solution offers total control, but on the other hand, it could be a high-toil play.


6. A Managed Solution: NHI Governance

Instead of building and maintaining our own NHI aggregation platform, we can leverage a dedicated solution, like GitGuardian NHI Governance. It's designed to centralize the discovery, inventory, and management of Non-Human Identities (NHIs) and their secrets across the entire environment. Key features include:

  • Wide range of integrations: NHI Governance connects to a wide range of sources, including secret managers (Vault, AWS Secrets Manager, Azure Key Vault, etc.), cloud IAMs (AWS IAM, Azure Entra), CI/CD systems (GitLab CI), and K8s clusters.
  • Unified exploration map: A centralized, searchable inventory of all discovered NHIs, enriched with metadata such as source, environment, usage, and policy breaches.
  • Policy enforcement: GitGuardian applies security policies (informed by the OWASP Top 10 for NHIs) to detect risks like public/internal leaks, cross-environment or reused secrets, long-lived credentials, and more. Breaches are highlighted in the inventory and mapped visually for fast remediation.
  • Permission and blast radius analysis: Integrations with AWS IAM and Azure Entra provide deep context on permissions, roles, and the potential impact of compromised credentials, helping you prioritize remediation efforts.
  • Continuous tracking: Dashboards track breached policies over time, vault coverage, integration health, secret age distribution, and incident trends, providing actionable insights.
  • Secure and scalable: Integrations use secure auth methods (e.g., OIDC), require only read permissions, and never expose secret values. It's designed to scale with your environment and support multiple tenants or environments.

Apparently, choosing a platform like GitGuardian NHI governance means no complexity and overhead of a DIY solution. What's more, we could benefit from ongoing improvements, new integrations, and up-to-date security policies without additional engineering effort. This allows us to spend less time on data collection and more time on improving our security posture. But again, only you know which is better in your case, a custom-made solution or a managed one.


Summary

In complex cloud-native environments, SREs face a growing challenge: Non-Human Identities (NHIs) scattered across dozens of systems like Vault, AWS IAM, Kubernetes, CI systems, and more. The fragmentation creates blind spots, making it nearly impossible to maintain a complete inventory, not to mention enforce security policies or respond effectively to incidents.

While secret managers are essential, they are not the full picture. Building a custom internal developer portal to centralize this information is tempting, but could be costly; on the other hand, we can also choose a dedicated platform like GitGuardian NHI Governance.

Ready to move from fragmented visibility to centralized control? Stop chasing down scattered secrets and start proactively managing your NHI security today!