AI Agent Evaluation: Building a Testing Harness
How to build an evaluation and testing harness for AI agents — LLM evals, regression suites, golden datasets, tracing and CI gates that move agents from pilot to production.
Every enterprise we talk to in 2026 has the same problem. They have a dozen agent pilots that demo beautifully and a board that wants them in production. The gap between those two states is not model quality. It is the absence of a way to answer one deceptively simple question: is this agent actually good enough, and will it stay good enough after the next change?
That question is what an AI agent evaluation harness exists to answer. This post is the playbook we use at CC Conceptualise to build one — the architecture, the scoring strategy, the CI integration, and the governance evidence it produces. It assumes you have read our companion pieces on the Microsoft Agent Framework 1.0 architecture and agent-to-agent A2A protocol patterns; evaluation is the discipline that makes those patterns safe to ship.
TL;DR / Key takeaways
- An agent testing harness is automated machinery that scores agent behaviour against curated datasets and gates deployment in CI — it is the single biggest difference between a pilot and a production system.
- Evaluate the trajectory (tool calls, reasoning steps, retrievals), not just the final answer. Most agent failures are bad tool selection or context, not bad prose.
- Combine three scorer types: deterministic assertions, LLM-as-judge with calibration, and human review for high-risk cases. Never rely on a single one.
- Treat your golden dataset as a versioned, growing asset. Every production incident becomes a permanent agent regression testing case.
- A harness is also a compliance instrument: it produces the documented testing and monitoring evidence the EU AI Act and DORA expect.
Why agents break the old testing model
Traditional software testing rests on determinism. Given input X, assert output Y. Agents demolish that assumption. The same prompt can yield different wording, a different tool order, or a different retrieval set on every run, and several of those outcomes may be equally correct. A test that asserts byte-for-byte equality is either permanently red or so loose it catches nothing.
The defining shift of 2026 is pilots-to-production, and the work that shift demands is unglamorous: reliability, observability, security, cost governance, and above all LLM evals. Microsoft's Agent Framework reaching general availability in April 2026, with Azure AI Foundry as the platform for building and governing agents, removed most of the runtime excuses. What remains is the engineering discipline of proving an agent works and keeping it working.
There are four failure surfaces a harness has to cover, and final-answer quality is only one of them:
| Failure surface | What goes wrong | What the harness checks |
|---|---|---|
| Final output | Wrong, unfaithful, or unsafe answer | Correctness, faithfulness, groundedness, safety |
| Trajectory | Wrong tool, wrong order, missing step | Tool-selection accuracy, step assertions, loop detection |
| Retrieval / context | Irrelevant or stale context fed to the model | Context relevance, recall, citation correctness |
| Operational | Too slow, too expensive, flaky | Latency, token cost, error rate, pass-rate stability |
Most teams instrument only the first row. In our delivery experience the costly production incidents almost always originate in the middle two — an agent that quietly picks the wrong tool or grounds its answer in the wrong document.
Anatomy of the harness
A good harness has five components. Keep them decoupled so you can swap models, scorers, or runtimes without rewriting everything.
1. The dataset layer
This is the foundation. A golden dataset is a versioned collection of test cases, each pairing an input (and any required state) with either an expected outcome or a rubric. Seed it from four sources:
- Representative traffic — real, anonymised inputs that reflect normal use.
- Hard edge cases — ambiguous, multi-step, or long-context inputs.
- Adversarial cases — prompt injection, jailbreaks, out-of-scope requests.
- Regression cases — every past production failure, captured verbatim.
Store it as data, not code (JSONL or a table), and version it in the repository alongside the agent. The discipline that matters most: every time production surfaces a new failure mode, you add a case. The suite hardens monotonically.
2. The runner
The runner executes the agent against each dataset entry in an isolated, reproducible environment. Pin model versions, temperatures, system prompts, and tool definitions. Capture the full trace of every run — inputs, intermediate reasoning, every tool call and result, the final output, latency and token counts. This trace is what you score, and later what you hand to auditors.
3. The scorers
This is where judgement lives. Use three layers:
- Deterministic checks — schema validation, regex, "did it call
refund_toolexactly once", "did the SQL it generated parse and run read-only". Cheap, fast, unambiguous. Use these for anything safety- or compliance-critical. - LLM-as-judge — a model scoring faithfulness, helpfulness, tone, or rubric adherence at scale. Powerful but fallible; calibrate it (see below).
- Human-in-the-loop — periodic expert review of a sample, both to spot-check the judge and to label new ground truth.
4. The gate
Scores feed a policy that decides pass or fail. Define thresholds per metric and per environment — staging may accept 92% correctness, production demands 98% with zero safety violations. The gate runs in CI so no agent change merges without passing agent quality assurance.
5. Observability and tracing
The same instrumentation you use offline must run in production. Distributed tracing of agent runs lets you sample live traffic, detect drift, and feed fresh failures straight back into the dataset. Azure AI Foundry's tracing and evaluation SDK make this loop concrete on Azure; the pattern itself is platform-agnostic.
Calibrating LLM-as-judge
LLM-as-judge is the only economically viable way to score thousands of open-ended outputs, but an uncalibrated judge is a confident liar. Our checklist:
- Hand-label a calibration set of 100–300 outputs with human experts.
- Run the judge over the same set and measure agreement (e.g. Cohen's kappa).
- Iterate the judge prompt and rubric until agreement is acceptable for the risk level.
- Pin the judge model and prompt version; a judge that silently changes invalidates your trend lines.
- Re-validate agreement whenever you change the judge or upgrade the model.
For high-risk decisions, never let the judge be the sole gate. Pair it with deterministic checks and route low-confidence scores to human review.
Wiring it into CI/CD
A harness that runs manually runs rarely. Make evaluation a first-class CI stage with two tiers:
| Tier | When it runs | Scope | Budget |
|---|---|---|---|
| Smoke evals | Every pull request | 30–80 fast, deterministic cases | Seconds–minutes, low cost |
| Full regression | Pre-release / nightly | Entire golden dataset incl. judge scoring | Minutes, metered token spend |
The smoke tier keeps the inner loop fast; the full agent regression testing suite is the release gate. Crucially, track scores over time, not just pass/fail. A correctness metric drifting from 97% to 94% across three releases is a signal you want long before it becomes an incident. Tools and orchestration matter here too — agents that depend on external tools should be tested against the contracts described in our note on MCP server design for enterprise, with mocked and live variants.
The harness as a compliance instrument
For European enterprises this is not optional hygiene; it is regulatory evidence. Higher-risk AI systems under the EU AI Act are expected to demonstrate documented testing, accuracy, robustness, and ongoing post-market monitoring. Financial entities under DORA must show resilience testing and traceability. A versioned harness with stored datasets, traces, and scores answers the auditor's three questions directly: what did you test, when, and with what result?
We have delivered this for regulated clients where the evaluation repository — not a slide deck — became the primary artefact in the conformity conversation. The lesson: design the harness so its outputs are exportable, time-stamped, and tied to a specific agent and model version. Evidence you have to reconstruct after the fact is evidence you will get wrong.
A pragmatic rollout in five steps
- Capture 50 golden cases from real traffic and known pain points. Do not wait for a perfect dataset; start small and grow.
- Build the runner and trace capture with pinned model and prompt versions.
- Add deterministic scorers first, then introduce a calibrated LLM judge for the open-ended dimensions.
- Set CI gates with explicit per-environment thresholds and trend tracking.
- Close the loop — pipe production traces back into the dataset weekly so the suite reflects reality.
This sequence delivers a usable gate within days and a mature, drift-aware harness within a quarter.
The bottom line
An agent without an evaluation harness is a prototype, regardless of how impressive its demo is. The harness — datasets, runner, layered scorers, CI gates, and live tracing — is what converts a non-deterministic model into software you can govern, ship, and defend to a regulator. The tooling in 2026 is mature enough that there is no excuse left; what remains is the engineering discipline to build the loop and feed it.
If you are moving agents from pilot to production and want a harness that doubles as your compliance evidence, our AI and data platform engineering team can help you design and stand it up. No body shop — a strategic engineering partner that has delivered this.
FAQ
What is an AI agent evaluation harness?
An evaluation harness is the automated machinery that scores an agent's behaviour against a curated dataset of inputs and expected outcomes. It combines deterministic checks, LLM-as-judge scoring, trajectory analysis and tool-call assertions, then gates deployments in CI. Without one, you are shipping non-deterministic software on vibes.
How is agent evaluation different from traditional software testing?
Traditional tests assert exact outputs; agents are non-deterministic, so the same input can produce different but equally valid responses. Evaluation therefore scores quality across many dimensions — correctness, faithfulness, tool selection, latency and cost — and reasons about acceptable ranges and pass rates rather than byte-for-byte equality. You also evaluate the trajectory, not just the final answer.
What should a golden dataset for agent evaluation contain?
It should contain representative real-world inputs, known-hard edge cases, adversarial prompts, and regression cases captured from past incidents. Each entry pairs an input with expected outcomes or rubric criteria. We version it alongside code and grow it every time production surfaces a failure, so the suite hardens over time.
Can I trust LLM-as-judge scoring?
LLM-as-judge is useful and scalable but imperfect. Calibrate judges against human-labelled samples, measure agreement, pin model and prompt versions, and reserve deterministic checks for anything safety- or compliance-critical. Treat judge scores as signals within a broader harness, never as the sole gate for high-risk decisions.
How does agent evaluation support EU AI Act and DORA compliance?
For higher-risk systems the EU AI Act expects documented testing, accuracy and robustness evidence, and ongoing monitoring; DORA expects resilience testing and traceability for financial entities. A versioned harness with stored traces and scores produces exactly this evidence — a defensible audit trail of what was tested, when, and with what result.
Where does Azure AI Foundry fit into an evaluation harness?
Azure AI Foundry provides built-in evaluators, tracing and an evaluation SDK that you can wire into CI, and it integrates with the Microsoft Agent Framework that reached GA in April 2026. It is a strong default for Microsoft-centric estates, but the harness pattern — datasets, scorers, gates, observability — is platform-agnostic and should outlive any single tool.
Topics