SREday SF 2025: Human Centered SRE In An AI World

San Francisco’s cable cars are the only moving National Historic Landmark in the United States, a century-old system that can deliver modern reliability when skilled people guide the machinery. Watching a gripman work the brake down Powell Street is a lesson in human-centered control. You can mechanize the track, instrument the descent, and rely on a person to ultimately make emergency calls that keep riders safe. This made the city a perfect backdrop for SRE Day San Francisco 2025.

Throughout the day, around one hundred Site Reliability Engineers, DevOps professionals, and other IT folks gathered for 2 tracks of talks. Throughout the 20 sessions, we heard a recurring sentiment that tools matter, telemetry matters, but human judgment is the boundary between graceful resilience and quiet catastrophe. The best automation behaves like that cable car system, ever present and predictable, but designed to work with people rather than erase them. That is the posture security and Site Reliability must take as AI systems, non-human identities, and agentic automation push deeper into production. Here are just a few highlights from the latest event, organized by the SREday team around the world.

Automation that listens to people

In the session “The Human Factor in Site Reliability: Designing Automation That Amplifies Engineering,” from Jimmy Katiyar, Senior Product Manager at SiriusXM, the focus was that automation without context can be spectacular at the wrong thing. Jimmy’s stories from SiriusXM cut through the promises-only view of full autonomy. He didn't deride automation, but said that human judgment must stay in the loop for decisions that can damage customer trust or compound blast radius.

He explained that concrete operational outcomes need humans to remain at the center. Teams can see faster mean time to recovery (MTTR) by pairing runbook automation and enriched alerts with explicit human decision points. There are fewer repeated incidents when engineers look past retries and ask whether a deeper change in a model or schema has broken a hidden contract. He has seen this team achieve higher reliability by allowing human judgment to pause automation when confidence is low and ambiguity is high.

AI and rules engines can do the heavy lifting and toil, but people must make the calls that depend on the broader context the model cannot see.

Observability that drives decisions

In the session “From Dashboard to Defense: Automating Resilience at Large Scale,” Sureshkumar Karuppuchamy, Engineering Lead at eBay, explained that dashboards do not act. People carrying a pager at two in the morning cannot parse a million metrics, and heroic incident responses burn out the best of us. He described a shift that starts with SLI-driven thinking. Measure latency and checkout success rather than chasing vanity counters. Make OpenTelemetry the first principle so metrics, traces, and logs tell the same story. Follow a user’s pain through the system without hoping a heatmap reveals meaning.

The pivot from static thresholds to change-aware observability is especially relevant for AI systems. Static alarms treat the world like yesterday’s traffic and miss the shape of a new release or a sudden client-side error burst. Sureshkuman urged us to watch for meaningful deviation and to price it in operational risk terms, based on capacity anomaly detectors, browser error spikes, and AI models for forecasting. We need control loop security for policy enforcement through agents and for guarding against adversarial inputs that can skew telemetry.

Sureshkuman walked us through a "staged autonomy pattern." First is 'shadow' mode to learn and build trust. Then on to 'suggest' mode to keep humans approving the plan. And finally, 'autonomous' mode with guardrails for transparency and reversibility. It mirrors the policy posture we want for non-human identities and privileged access, and it sets a standard for how agentic automation should behave in production. Acting as a teammate that shows its work.

Chaos that proves or disproves what you believe

In “Transform chaos experiments into actionable insights using generative AI,” from Saurabh Kumar, Principal Solution Architect, and Ruskin Dantra, Solutions Aarchitect, both from AWS, showed that the problem was not the tooling. Anyone can spin up failure. The hard part is creating a good hypothesis and a disciplined verification step. They described a loop that starts with a steady state, builds a hypothesis informed by architecture and dashboards, then runs experiments through AWS Fault Injection Service, and finally feeds the rich telemetry back into an LLM for synthesis.

They exposed a frequent failure pattern where vague hypotheses that use words like 'significantly' and 'consistent' collapse under verification. When they tied the hypothesis to business-relevant metrics such as API gateway errors and user experience measures, the loop improved.

If you want AI to help you verify, you must write down what good looks like in terms that reflect the user, the application, and the system. Better yet, translate steady state into business observability so a deviation has a dollar sign. That is how you keep chaos from becoming a stunt. It becomes security observability and a threat modeling tool that exercises real trust boundaries.

Saurabh Kumar (foreground) & Ruskin Dantra

Data that tells the truth about usage

In “10 Billion Downloads: Insights and Trends in Open Source,” Avi Press Founder & CEO at Scarf, explained that downloads are not users. In his work, they keep metrics on open source package downloads and usage. While impressive download numbers might generate buzz, automated systems pull down images and artifacts constantly, and a single project can rack up astronomical counts while active usage remains flat or even declines. The ratio of downloads to unique users can span orders of magnitude and often hides behind proxies, mirrors, and CI noise. If you are making security and product decisions based on download charts, you are probably flying blind.

Avi’s larger point was about observability and identity in the supply chain. Registries and gateways can see what gets downloaded, when, and from where. Rich user agents and CI flags help distinguish humans from automation. That distinction matters when you are prioritizing vulnerability response, dependency policy, and how you invest in controls that protect your distribution. It also matters when you evaluate the risk of secrets sprawl in build systems and the behavior of agents that authenticate to registries on your behalf.

Avi also said people do not like to upgrade or think about upgrading. From their research, they estimate 80% of traffic is pulling the `:latest` version of a package. This can introduce new vulnerabilities, especially in fast-moving attacks like we recently saw with Shai-Hulud. On the other end of the aisle, though, when they see a project download a specific "pinned" version of a package, they never download another version of that package ever again. This likely contains many packages with known vulnerabilities being reused over and over, as it has proved stable so far.

Human-centered automation

Complex, automated systems are always socio-technical. Every outage that matters crosses a human boundary. A pager decision, a rollback judgment, a missing intent record. The universal truth beneath the sessions is that human judgment is the only control that adapts fast enough to ambiguity. We keep people in the loop because ambiguity beats rules, and production is mostly ambiguity.

Ambiguity Defeats Static Policy

Dashboards, thresholds, and runbooks fall short at the worst moments because production signals are underspecified representations of reality. Metrics compress experience. Traces compress causality. Logs compress narrative. Compression discards context. When AI or rules act on compressed inputs without recovering context, they optimize the wrong objective. Security feels this first. Adversaries live in the discarded context. They weaponize edge cases that static policy cannot name.

Incentives Warp Signals

Incentives across teams and AI systems push speed and simplicity, not fidelity. Teams collect telemetry that is cheap to compute and easy to visualize. Vendors sell abstractions that look clean while burying the messy parts. Open source download counts become a proxy for adoption because they are easy to chart. The same dynamic exists in access control where broad roles and shared secrets move faster than granular IAM for machine identities. Speed without fidelity produces brittle control and operational risk.

Verification Lags Change

We discover governance gaps after damage, as verification still moves slower than change. Hypotheses remain vague in many experiments and POCs. Evidence remains scattered. Postmortems reconstruct truth from fragments. LLMs can summarize, but they cannot replace a precise definition of steady state or explicit success criteria. Until verification is codified as a fast loop tied to business outcomes, risk signals arrive too late to shape decisions and we will continue to see timid automation or reckless autonomy.

Trust Is The real SLO

Trust is the currency customers spend to stay when something goes wrong. Trust is also the currency engineers spend to grant autonomy to a system. If a rollout can explain itself, if an agent can justify access, if a failure can be reversed cleanly, people keep trusting. If not, they route around automation. Security is trust formalized as policy and evidence, just as reliability is trust expressed as steady state behavior. The universal truth is that both disciplines are negotiating the same contract.

The craft of shared control

San Francisco’s steep terrain runs on small rituals of control, and SRE follows the same rhythm. We need to purposefully refine our automation to achieve the metrics that matter to the business. The message was to build systems where humans and machines share responsibility through transparent, auditable, and reversible decisions.

Your author was able to talk about another kind of observability, as I laid out the findings of the State of Secrets Sprawl. Just as with observability, teams need to create a plan to identify, store, and automate the rotation of secrets at scale. Those plans have a common path: Start small by choosing one decision boundary, make it observable, practice the rollback, and repeat. Resilience grows not from total autonomy but from the craft of earning trust one reliable action at a time.