There's a meeting that happens about ninety days after a firm ships its first LLM-powered agent. Someone asks: is it actually working? And the room goes quiet, because nobody has the data to answer.
This is the performance analytics gap. Shipping an agent is now a one-week problem. Knowing whether the agent is getting better — week over week, against a stable benchmark — is still a discipline most teams haven't built.
Why the gap exists
Traditional software has a clean unit of measurement: it either does the thing it was specced to do, or it doesn't. Tests pass. Bugs file. Velocity ships.
Agents don't fit. Their output is probabilistic, the surface area is enormous, and the failure modes are subtle — wrong tone, missed context, plausible-sounding hallucinations. The classical telemetry stack (logs, error rates, p99 latency) tells you if the agent ran. It doesn't tell you whether it ran well.
The gap is methodological. We've been treating agents as services and asking service-shaped questions. The right question is closer to: is this agent improving as a worker?
What an agent performance program actually contains
An honest agent performance program has four components, and most teams have one or two:
Telemetry. Structured event logging at the level of every prompt, every retrieval, every tool call, every model response. Not "the agent ran" — every step the agent took, with inputs and outputs preserved.
Eval harness. A golden dataset of inputs and ideal-or-acceptable outputs, run against every meaningful change. The harness catches regressions before they ship and quantifies improvements that would otherwise be invisible.
Cost / latency / quality dashboards. Per agent, per task, over time. Cost per resolved ticket. Median time-to-first-token. Win rate against a baseline. The point isn't the dashboard; it's the conversation it forces.
Human-in-the-loop review. A labelling workflow where humans grade outputs the harness can't. The labels feed back into the golden dataset. The golden dataset gets sharper. The harness gets more useful. The loop closes.
The compounding part — again
The reason this matters is the same reason the metasystem matters: it compounds. A team running this loop ships better agents quarter over quarter, without that improvement depending on heroics. A team without it ships an agent, watches it drift, and ships a different agent.
The discipline isn't sexy. It's instrumentation, dashboards, weekly review, slow incremental improvement. It looks like the boring engineering practice it is. But the gap between teams that have it and teams that don't will, in eighteen months, be the gap between teams that compound and teams that don't.
That's the part worth building now.