AI Agent Observability: Tracing, Spans and Evals

TL;DR / Key takeaways

Tracing, spans, and evals are three different jobs. Traces and spans tell you what the agent did; evals tell you whether it was correct. Production agents need both.
Standardise on OpenTelemetry. The Microsoft Agent Framework, which reached General Availability on 3 April 2026, emits OpenTelemetry-compatible traces, so you avoid vendor lock-in and can route telemetry to Azure Monitor or any OTLP backend.
The 2026 shift is pilots-to-production. With 400,000+ custom agents deployed across 160,000+ organisations, reliability, observability, security, evaluation, and cost governance are now the hard part — not building the agent.
Instrument the full causal tree. Capture the user turn, every model call, every tool invocation, retrieval steps, and sub-agent handoffs as parent-child spans, with token and cost attributes on each.
Gate deployments on eval scores, not vibes. Continuous online evals make quality a monitorable metric you can alert on and roll back against.

The problem: agents that pass the demo and fail the audit

Almost every enterprise we work with has the same story. A team builds an agent that demos beautifully. It answers questions, calls a tool or two, and looks production-ready. Then it ships, and three weeks later someone asks a question that should be simple: why did the agent refund that customer twice? Nobody can answer, because nobody can see what the agent actually did.

This is the defining problem of 2026. The industry has moved decisively from pilots to production — over 160,000 organisations have deployed more than 400,000 custom agents on Copilot Studio alone — and the binding constraint is no longer "can we build an agent" but "can we run one we trust." The hard parts are reliability, observability, security, evaluation, and cost governance. Observability sits underneath all of them, because you cannot secure, evaluate, or cost-govern what you cannot see.

Agents break the assumptions classic monitoring is built on. A traditional service is deterministic: the same request runs the same code path. An agent is non-deterministic — it reasons, picks tools, retries, and delegates, and the same input can take a different route every time. So a dashboard showing 200ms latency and zero 500s tells you almost nothing about whether the agent did the right thing.

Three layers: traces, spans, and evals

It helps to separate the concerns precisely, because teams routinely conflate them and then wonder why "we have logging" did not save them.

Layer	Question it answers	Primary signal	Tooling on Azure
Traces	What was the full sequence of steps for this run?	A tree of spans per agent run	Azure AI Foundry tracing, Application Insights
Spans	What happened inside one step (model call, tool, retrieval)?	Latency, tokens, input/output, status	OpenTelemetry GenAI semantic conventions
Evals	Was the result correct, grounded, safe, complete?	Scores per dimension	Azure AI Foundry evaluation

Traces: the causal tree of a run

A trace is the complete record of a single agent invocation, from the user's request to the final response. It is a tree, not a list: the root span is the agent turn, and children include each model call, each tool invocation, each retrieval, and each handoff to a sub-agent. When an agent delegates work to another agent over the A2A protocol, that delegation should appear as a linked span so the causal chain survives the network hop.

Spans: where cost and latency actually live

A span is one unit of work. The discipline that matters in practice is attribute hygiene — putting the right semantic data on each span. The OpenTelemetry GenAI semantic conventions define standard attributes for this: model name, prompt and completion tokens, tool name, and operation type. Get this right and you can answer questions like "which tool call is burning 60% of our tokens" by grouping spans, instead of grepping logs.

Evals: the only layer that measures correctness

Tracing and spans are necessary but not sufficient. A trace can show a fast, error-free run that produced a confidently wrong answer. Evals are automated judgements of output quality across dimensions such as groundedness (did it stick to the retrieved context), correctness, safety, and task completion. Run offline against a golden dataset before deployment, and continuously online against a sample of live traffic, evals convert "the agent feels good" into a number you can alert on.

Instrumenting agents with the Microsoft Agent Framework

The Microsoft Agent Framework 1.0, generally available since 3 April 2026, is an open-source framework for .NET and Python that bakes observability in rather than bolting it on. It emits OpenTelemetry-compatible traces natively, which is the single most important design decision for avoiding lock-in: the same spans flow to Azure Monitor, Application Insights, or any OTLP-compatible backend you already run. For the architectural context of how the framework structures agents, runtimes, and the A2A and MCP protocols, see our breakdown of the Agent Framework 1.0 architecture.

Loading diagram...

A practical instrumentation checklist we apply on engagements:

Enable framework tracing and set the OTLP exporter to your Application Insights or Azure Monitor workspace. Do not write a custom tracer; use the conventions you get for free.
Propagate trace context across every boundary — tool calls, MCP servers, and A2A handoffs — so a single trace ID stitches the whole distributed run together. Context propagation across MCP servers is the most common place we see traces fragment.
Record token and cost attributes on every model span. Cost governance is impossible after the fact if the data was never captured.
Add business correlation IDs (tenant, case, user) as span attributes so you can trace from a support ticket straight to the run that caused it.
Sample intelligently. Trace 100% of errors and a representative sample of successes; full-fidelity tracing of every token in every run gets expensive fast.

When we deployed an internal claims-triage agent for a European insurer, it was step 4 — business correlation IDs — that turned a two-day incident investigation into a two-minute trace lookup. The agent had double-actioned a case; the trace showed a tool that was retried after a timeout but had in fact succeeded the first time. No amount of latency dashboards would have surfaced that.

From traces to evals: closing the quality loop

Once telemetry flows, the next maturity step is making quality measurable. Azure AI Foundry — the central platform for building, deploying, and governing agents — supports both offline evaluation against curated datasets and continuous online evaluation against live traffic. The pattern we recommend:

A pragmatic eval pipeline

Build a golden dataset of representative inputs with known-good outputs or rubrics. Keep it version-controlled and growing from real incidents.
Define eval dimensions that map to your risk: groundedness and correctness for accuracy, safety and harmful-content checks for compliance, and task completion for usefulness.
Gate the deployment. No agent ships if eval scores regress below threshold against the golden set. Treat this exactly like a failing test suite.
Run continuous online evals on a sample of production traces, using an LLM-as-judge or rule-based scorers, and emit the scores as metrics.
Alert and roll back on score drift. A groundedness score sliding from 0.92 to 0.78 over a week is an incident, even if latency and error rates look perfect.

This is where observability stops being a debugging convenience and becomes a control. For regulated European workloads, it is also where the compliance story lives: the EU AI Act expects providers and deployers of higher-risk systems to maintain logging, traceability, and human oversight, and DORA and NIS2 raise the bar on operational resilience and incident evidence. Eval scores and traces are precisely the Nachweispflichten — the evidence — auditors will ask for. Designing observability in from day one is far cheaper than retrofitting it under an audit deadline.

Common anti-patterns

A few failure modes we see repeatedly:

Logging instead of tracing. Unstructured log lines cannot reconstruct a causal tree. You need spans with parent-child relationships.
Tracing without evals. You will have perfect visibility into a confidently wrong agent.
Capturing prompts but not redacting PII. Trace payloads often contain personal data; apply redaction at the exporter, not as an afterthought, to stay GDPR-aligned.
Per-vendor instrumentation. Wiring an agent to one proprietary observability SDK locks the telemetry in. OpenTelemetry keeps it portable.
No cost attribution. Without token-and-cost attributes on spans, your first surprise will be the bill.

Where to start

If you are moving agents from pilot to production, do these three things first: turn on OpenTelemetry-based tracing in the Agent Framework, attach token and business correlation attributes to every span, and stand up a small golden-dataset eval gate before you ship. Everything else builds on that foundation.

We help European enterprises design and operate this stack end to end — tracing, evals, cost governance, and the regulatory evidence layer on Azure AI Foundry. If you are wrestling with agents that demo well but cannot be trusted in production, our AI and data platform engineering team can help you make them observable, evaluable, and audit-ready.

FAQ

What is AI agent observability?

AI agent observability is the practice of capturing structured telemetry — traces, spans, metrics, and logs — across every step an agent takes, from the initial user request through tool calls, model invocations, and sub-agent handoffs. Unlike classic application monitoring, it must record non-deterministic reasoning, token usage, and the quality of outputs, not just latency and errors. The goal is to answer not only whether an agent ran, but whether it reasoned and acted correctly.

How is observability for AI agents different from traditional APM?

Traditional APM assumes deterministic code paths and measures latency, throughput, and error rates. Agents are non-deterministic: the same input can produce different tool calls and outputs. Agent observability adds semantic dimensions — prompt and completion content, token cost, tool selection, retrieval relevance, and eval scores — and treats correctness as a first-class signal alongside performance. You need both the operational view and the quality view.

What role does OpenTelemetry play in agent tracing?

OpenTelemetry provides the vendor-neutral standard for emitting traces and spans, and its GenAI semantic conventions define how to record model, token, and tool attributes consistently. The Microsoft Agent Framework emits OpenTelemetry-compatible traces out of the box, so you can route the same telemetry to Azure Monitor, Application Insights, or any OTLP backend without lock-in. Standardising on OpenTelemetry keeps your observability portable across clouds and tools.

What are spans in the context of an AI agent?

A span represents a single unit of work within a trace — for example one model call, one tool invocation, or one retrieval step. A trace is the full tree of spans for one agent run, showing the parent-child relationships between reasoning steps, tool calls, and sub-agent delegations. Inspecting spans lets you see exactly where latency, cost, or incorrect decisions originated inside a multi-step agent.

Why are evals necessary if I already have tracing?

Tracing tells you what happened; evals tell you whether it was good. A trace can show a perfectly fast, error-free run that nonetheless produced a wrong, unsafe, or non-compliant answer. Evals — automated scoring of correctness, groundedness, safety, and task completion — turn quality into a measurable, monitorable signal that you can gate deployments on and track over time.

How does Azure AI Foundry support agent observability?

Azure AI Foundry is the central platform for building, deploying, and governing agents, with built-in tracing, evaluation, and monitoring that integrate with Azure Monitor and Application Insights. It captures end-to-end traces of agent runs, supports both offline and continuous online evaluation, and provides cost and token telemetry for governance. Combined with the Microsoft Agent Framework, it gives a production-grade observability stack on Azure.