A talk by Diamond Bishop at MCP Dev Summit North America 2026
TLDR: Getting an AI agent to work is a different problem from getting it to run in production without someone watching it. Datadog built its first hundred agents across SRE, code generation, and security investigation. The hardest parts had nothing to do with model quality.
Datadog’s whole business is watching systems fail in real time. Building agents it couldn’t observe was not a comfortable position to be in. Diamond Bishop, Director of Eng/AI, walked through how they got from 0 to 100 production agents and what has to be accomplished before the next thousand.
These are the five lessons that stuck.
1. Treat agents as your first customers
Datadog already has an MCP server that lets external agents query its platform directly. That’s table stakes now. The more interesting point was about what teams aren’t doing.
Most product and UX teams still design entirely for human users. Meanwhile, developers are quietly building workarounds to make those same interfaces machine-readable. The gap is real and it’s widening. The fix is organizational: get design teams thinking about agents as a user class before engineers have to compensate for them.
The framing Datadog uses internally is a riff on the old Bezos API mandate. Every interface inside your company should be something an agent can use. If a task can only be completed through a UI built for humans, that’s a gap in your platform, not a feature.
2. Run agents in the background, not on your laptop
The agents that are actually earning their keep at Datadog aren’t the chat-based ones. They’re the ones running quietly in the background on real events.
Datadog has three agents in GA today.
- Bits.ai SRE: autonomous alert investigator. It fires when something breaks and traces the issue before the engineer gets to their desk in the morning.
- Bits.ai Dev: watches for errors and latency problems in live services and proposes code fixes without waiting to be asked.
- Security Analyst: works through investigation checklists on concerning alerts automatically, handling the repetitive triage work that humans were doing by hand.
All three share the same architectural requirement: they run without a human in the loop. That means they need to be event-driven, containerized, and durable. Datadog runs these with Temporal for durability. Running long-lived agents on local machines is a fast path to fragile systems.
3. Don’t ship an agent you can’t measure
If you don’t have an eval system before you launch, don’t launch.
Datadog runs offline eval, online eval, and a living eval system that updates as behavior drifts. Models change, data drifts, an agent that performed well in testing will eventually diverge from production reality without a feedback loop catching it.
The practical extension of this is making your eval system itself agent-accessible. Expose it through an MCP server, let an agent work the improvement loop, and you get a system that can get better on its own over time.
4. Build to rewrite, not to preserve
Model rankings flip faster than most teams expect. A few months ago the conventional wisdom in some circles was that Anthropic’s models had plateaued. Then Claude came back. Now Codex is getting attention again. No one knows where it stabilizes.
The practical response is to build agent harnesses that don’t assume a specific model or framework. Keep them simple enough that swapping a model out isn’t a refactor project. The thing worth preserving isn’t the harness itself, it’s the memory system holding the accumulated knowledge your agents have gathered. That’s what lets you carry learnings forward when the underlying model changes.
5. Multiplayer means more than it used to
“Multiplayer” used to mean multiple humans working in the same space at the same time. That’s no longer the only configuration that matters. At Datadog, production agent systems run three distinct pairings:
- Human working with human
- Human working with agent
- Agent working with agent
Each one needs its own communication patterns. You can’t design for one and assume the others work.
The “who watches the watchman” problem is real. An agent that monitors another agent still needs oversight somewhere. Designing that layer in explicitly, rather than hoping it works out, is what separates a proof of concept from a production system.
Diamond Bishop is Director of Eng/AI at Datadog. The Agentic AI Foundation is the home of open agentic standards and open source infrastructure. To learn more about MCP and connect with engineers thinking through these problems, visit aaif.io, join the conversation in the AAIF Discord, or join us at an upcoming AAIF event.