Reliability Lessons From the Edges at SREday NYC

60% of the tracks of the New York City subway system are hidden underground. But there are many more invisible handoffs that keep everything running. Each train arrives because of signals, switches, power, dispatch, maintenance windows, and human judgment, which all line up so well that riders only notice the system when something goes awry. This is also true for our applications and distributed systems that we build in the enterprise, which was top of mind at SREday NYC Q2 2026.

Around 100 site reliability engineers, students, and other technically minded folks focused on high-throughput, scalable systems gathered at the Datadog office inside the New York Times building for a full day of sessions and conversations. Twenty subject matter experts gave talks and a mini-workshop, covering a wide array of topics around how to keep our applications running and displaying all green lights on our dashboards, as well as how to prepare to confidently answer questions and react when chaos does emerge. It was clear from the day's conversations that modern reliability depends on understanding the systems under the surface as much as the final production applications. Here are just a few highlights from this edition of SREday.

Building a World Model for Production

Shreyas Iyer, Co-Founder and CTO at Antimetal, presented "Building a World Model for Production." He said AI has made building and shipping dramatically easier, but our production systems remain mostly poorly documented, all the while continually evolving how everything is interconnected. That creates a representation gap, where the system's real behavior outpaces any human’s ability to hold it all in their head.

Taking action is the easy part for AI agents, as we build better tools to retrieve data and execute tasks. The hard part is having all the needed relevant context to take the "right" action. A useful agent needs to know what exists, what changed, how behavior propagates, and what those systems mean to the humans running them. Shreyas is pushing for wider adoption of a world model which is systematically built across four pillars of knowledge:

Structural - What are the components and how do they interact?
Temporal - What changed, and when?
Causal - How does behavior propagate?
Semantic - How humans refer to the system?

For example, when an exporter’s error rate spikes, an agent should be able to resolve the alert to a specific service and owner, map dependencies, find recent changes, and trace likely causes, just as an experienced and knowledgeable human operator would.

There are some real challenges that need to be solved for world models, such as how to handle the fact that all data eventually becomes stale. There is also the ever-present concern of how we define the systems that will act as the source of truth. World models also currently struggle with generalizing meaning across all teams, meaning the 'world' they represent must be smaller than ideal. Shreyas argued that capability will only continue to get cheaper, but context will remain the expensive part.

When Fast Enough, On Average, Still Fails

In her session "When Milliseconds Cost Millions: Designing Low-Latency AI Systems in Fintech," Anisha Manoharan, speaker from Bank of Julius Baer, brought us into the latency conversation fintech is having. For banks and payment processors, a “mostly fast” system can still result in costly failures. A single transaction may look like one clean JSON request returning a HTTP 200 response, but reality is that there are multiple hidden steps: authentication, customer history lookups, merchant risk lookups, feature computation, external AI calls, model inference, rules engine execution, compliance logging, and response auditing, ensuring the transaction went through.

All of that has to happen in roughly 100 milliseconds, millions of times a day, all around the clock.

The AI model is actually not the bottleneck for most calls, according to Anisha. Often, the real pain lives in the slow database call, the queued feature lookup, or the service nobody notices until traffic spikes. That is why site reliability engineering teams care about P99, the 99th percentile response time. She explained that averages lie. For example, a 35 millisecond average can make a dashboard look healthy while the slowest 1% of requests are taking long enough to create false declines, missed fraud, or broken customer experiences. At the fintech scale of thousands of transactions per second, even 1% pain can mean real money moving through slow, degraded, or unsafe paths.

Anisha highlighted the cascade failures an upstream service can cause if it slows, queues build, and timeouts fire. This can mean the system may "fail open" to keep payments flowing. By the next morning’s reconciliation, the dashboard may still look green while the loss is already booked. She said that we need to "watch the tail, not the average." Scale on latency, not just CPU, and design for the cascade.

Shifting Left Without Shoving Left

Ian Miller, Engineering Manager, SRE at Seismic, presented "Shifting Left - Evolution of the Software Development Life Cycle," where he framed shifting left as more than pushing work earlier in the software development life cycle. As he defined it, "Shifting left is about moving ownership, capability, and accountability closer to the people best positioned to act on them." When done poorly, it shifts cognitive load onto application teams without giving them the tools, context, or guardrails to succeed. This is what he was warning us against.

He walked us through how this pattern of shifting left emerged from the world of microservices, where it helped teams scale ownership. But it also added coordination overhead. When Kubernetes became the default operating layer, complexity shifted from infrastructure teams to developers, who were expected to deploy, monitor, and troubleshoot their own services. SRE followed the same arc: teams now own service-level objectives, or SLOs, carry on-call for the code they ship, and instrument their own systems while SRE builds the shared platform underneath.

Ian said on many teams, "shift left" has become "shove left." AI makes that distinction even more important, as faster code generation and more pull requests do not automatically increase organizational throughput. All that new code still mostly relies on the same fragile pipelines, with unclear ownership and weak observability downstream. He said that every shift redistributes complexity; it does not eliminate it.

Shift left works when capability shifts down through paved roads, internal developer platforms, policy-as-code, and self-service tooling. AI is not the answer by itself. It just amplifies the teams and platforms you already have.

When Agents Sound Right And Still Fail

In the session "Perfect Reasoning, Wrong Answer," Willem Pienaar, Co-Founder and CTO at Cleric, drew a clear distinction between coding agents and production agents and explained why it matters. In coding, the loop is mostly closed: write code, run tests, fail, adjust, run tests again. Production is different. It is open, moving, noisy, and transient. Unlike coding environments, it is hard to accurately reproduce real-world running conditions. The same issue may not fail the same way twice, and the environment around it is constantly changing.

That is where agents get dangerous, according to Willem. Agents can latch onto the first signal, mistake noise for cause, hallucinate a confident explanation, or anchor on a past incident that only looks somewhat similar. In production incidents, the wrong answer is worse than no answer because it burns time by pushing teams toward fixes that do not address the real problem.

The path forward is better grounding. Agents need log clustering, service and dependency context, temporal reasoning, and cited evidence so humans can review the chain of support. They also need a closed loop where production becomes the judge: alert, diagnosis, fix applied, return to baseline. An agent cannot reliably grade its own work. Independent verification matters because agents, like people under pressure, often do not know what they do not know.

The End Of Easy Confidence In Dashboards

The mood of SREday New York Q2 2026 was that the industry has entered a new phase of operational anxiety. Every talk seemed to begin from the same lived reality: teams are building faster than their operational models can absorb. AI has made that tension impossible to ignore, because it adds new failure modes, new costs, new latency patterns, new security concerns, and new ambiguity to systems that were already hard to reason about.

The room was certainly not anti-AI, but everyone was very aware that AI has removed the old comfort of slow change. The deeper message was that SRE is being pulled back to its original purpose: helping organizations move quickly without lying to themselves about reliability.

When “Working” Is No Longer Enough

“Working” has become too shallow a definition of reliability. A system can return a successful response, stay within average latency, pass basic health checks, and still fail the user, the business, or the operator. AI makes this more visible because failure can now look like a confident wrong answer, wasted tokens, unsafe output, poor retrieval, or a cost spike with no obvious code change.

But the same pattern applies outside AI as well. The surface can look green while the truth is hiding in the tail, in the dependency chain, in the backlog, in the queue, or in the human process behind the alert.

Speed Has Outrun Context

A deeper concern from many of the speakers was that speed has outrun context. Teams can ship more, generate more, instrument more, and automate more, but that does not mean they understand more. In fact, every new layer adds another place where meaning can get lost. More telemetry can become more noise. More services can become more coordination overhead. More AI assistance can become more unverified action.

Modern reliability problems are less about a lack of data and more about the lack of a trustworthy model for what the data means. That is why so much of the event pointed toward grounding. SRE teams need better ways to connect what changed, who owns it, what depends on it, how it behaved before, and whether the system returned to a healthy state after action was taken. AI agents, incident tools, observability platforms, and developer workflows only become useful when they are tied to evidence, baselines, history, and human judgment.

SRE Is The Discipline Of Understanding And Reacting To Reality

SREs are being asked to defend reality inside increasingly abstract systems. Cloud hid the machines, and Kubernetes hid the infrastructure. AI now hides intent, reasoning, cost, and quality behind fluent output. The next version of reliability work is about making those hidden layers visible again.

The future of SRE will belong to teams that can slow down just enough to understand what production is really saying, then use that understanding to move faster with confidence.

The Work Ahead Is All About Context

The largest takeaway from this SREday is that reliability work is moving into a phase where confidence, like trust, has to be earned, not assumed. The teams that will find the most success will connect signals to meaning, maintain clear lines of ownership, and verify that the system is actually healthy from the user’s perspective. Green lights still matter. They just cannot be the whole story anymore.

Any path forward we take must not slow innovation while we work to increase speed in a more accountable way. Your author addressed this head-on in a session about basing all authentication mechanisms on identity rather than on standing privilege. While this might feel like more work in the short term, the longer-term benefits enable the scale we need while eliminating an entire category of security risk.

We must build systems that preserve context, expose hidden handoffs, and make it easier for humans and tools to reason from evidence rather than instinct. SRE has always lived in that gap between what we think is happening and what production is actually doing. As systems get more abstract, that gap gets more expensive. Closing it is the work ahead.