Azure AI Foundry Production Readiness Checklist

Building an agent that works in a demo takes an afternoon. Building one you can put in front of customers, auditors, and the board takes considerably more. The defining theme of 2026 is not whether enterprises can build AI agents — over 160,000 organizations have already deployed more than 400,000 custom agents on Copilot Studio alone. It is whether those agents survive contact with production: reliability, observability, security, evaluation, and cost governance.

This is the checklist we use at CC Conceptualise when we take an Azure AI Foundry agent from a promising pilot to a service the organisation can actually depend on. It is deliberately opinionated and grounded in what we have shipped, not in marketing decks.

TL;DR / Key takeaways

Production readiness is an operations problem, not a model problem. The blockers are observability, authorization, evaluation, and cost — not the LLM.
Azure AI Foundry is the governance plane; the Microsoft Agent Framework 1.0 is the runtime. Foundry gives you identity, tracing, evaluation, and policy around agents built on the GA framework.
No evaluation gate, no go-live. Without an automated eval suite you cannot detect quality, safety, or prompt-injection regressions before users do.
Least privilege is your primary defence against prompt injection. A scoped tool turns an injection from a breach into a non-event.
Cost governance is a launch requirement, not a clean-up task. Cap tokens, pick model tiers deliberately, and track cost-per-resolved-task.

Why pilots stall before production

The pattern is consistent across the engagements we see. A team builds an agent in Azure AI Foundry, wires up a couple of tools and a data source, demonstrates it answering questions impressively, and then stalls. The stall is rarely about the model. It is about the operational questions nobody asked during the demo: What happens when a tool times out? How do we know why the agent did what it did last Tuesday? What can this agent reach if someone smuggles instructions into a document it reads? How much is it costing per day, and is that proportionate to the value it delivers?

These are the same questions you would ask of any production service. The difference with agents is that the failure modes are less obvious and the blast radius can be larger, because an agent acts — it calls tools, writes data, and increasingly coordinates with other agents over the A2A protocol. A readiness checklist exists to make these implicit questions explicit and to force an answer before go-live.

The five production gates

We organise the checklist into five gates. An agent does not ship until all five are green. The table below is the executive summary; the sections that follow are the detail.

Loading diagram...

Gate	Core question	Primary Azure tooling	Go/no-go signal
Reliability	Does it fail safe and predictably?	Foundry deployment configs, retry policy	SLOs defined and met under load
Observability	Can we reconstruct any decision?	Azure Monitor, App Insights, OpenTelemetry tracing	Full trace per run, alerting live
Security	What can it reach, and is input trusted?	Entra ID, managed identity, Key Vault, RBAC	Least privilege enforced, injection tested
Evaluation	Is quality measured and gated?	Azure AI Foundry evaluations	Eval suite blocks regressions in CI
Cost	Is spend bounded and proportionate?	Budgets, alerts, token metrics	Per-agent budget and alerting active

Gate 1: Reliability

An agent is a distributed system in disguise. Every model call, every tool invocation, and every A2A hop is a network call that can be slow, fail, or return garbage. Production reliability starts with treating it that way.

Define SLOs for latency and task success rate, and load-test against them — agents behave very differently under concurrency than in a single-user demo.
Set explicit timeouts and bounded retries with backoff on every tool and model call. Unbounded retries are how a transient blip becomes a cost incident.
Specify deterministic fallback behaviour. When a tool is unavailable, the agent should degrade gracefully and tell the user, not hallucinate a plausible-sounding answer.
Cap agent loops. Any autonomous planning loop needs a maximum step count so a confused agent cannot run indefinitely.

Gate 2: Observability

This is the gate teams most often skip and most often regret. If you cannot answer "why did the agent do that?" you cannot operate it, debug it, or defend it to an auditor.

Instrument distributed tracing across the entire run: the prompt and its inputs, every tool call and result, each A2A and MCP server interaction, and token consumption. The Microsoft Agent Framework emits OpenTelemetry-compatible traces; export them to Azure Monitor and Application Insights alongside structured logs and metrics. Build dashboards for latency, error rate, token spend, and evaluation scores, and wire alerts to the on-call rotation. On one delivery, retrofitting tracing onto an already-live agent cost the client more than building it correctly would have from the start — observability is cheap before launch and expensive afterwards.

Gate 3: Security

Agents act on the world, which makes their permissions the centre of gravity for security.

Identity: authenticate the agent with a managed identity through Entra ID. No API keys in config, no shared secrets.
Least privilege: scope every tool and data connection to the minimum it needs. An agent that only reads a knowledge base should not hold write access to anything.
Treat all model input as untrusted: prompt injection is not theoretical. Any document, email, or web page the agent ingests can carry instructions. Least-privilege scoping is what turns a successful injection from a breach into a non-event.
Secrets: store them in Key Vault, reference them at runtime, and keep no standing credentials in the agent definition.
Tool allow-listing: the agent should only be able to invoke an explicitly approved set of tools, validated server-side.

Gate 4: Evaluation

Without evaluation you are flying blind on quality. Every prompt tweak, model version bump, or new tool can silently change behaviour.

Build a dataset of representative and adversarial cases and run it through Azure AI Foundry evaluations as part of CI. Score the dimensions that matter for your use case: groundedness against sources, factual correctness, safety, and resistance to injection. Establish a baseline and make the eval suite a hard release gate — any regression blocks the deployment. This is the single most effective control for keeping a production agent honest over time, and it is the one most pilots lack entirely.

Gate 5: Cost governance

Token spend scales with usage in ways that are easy to underestimate and hard to claw back after launch.

Set per-agent token budgets with alerts before, not after, the bill arrives.
Choose model tiers deliberately — route simple steps to smaller models and reserve frontier models for steps that genuinely need them.
Cache deterministic or repeated calls where it is safe to do so.
Track cost-per-resolved-task as your north-star economic metric. Spend that is not tied to delivered value is the first thing to question.

Governance and the regulatory layer

For European enterprises the checklist does double duty. The controls above are not only good engineering — they are the evidence base for compliance. An agent that touches personal data, supports decisions, or runs inside an essential entity brings the EU AI Act, GDPR, NIS2, and DORA into scope. The observability gate produces the logging and traceability auditors expect. The security gate is your access-governance evidence. The evaluation gate produces documented proof that the system performs as claimed. Human oversight — defining which actions require approval and retaining immutable audit logs — is both a safety control and a regulatory expectation.

Our consistent advice: build this evidence in from day one. Retrofitting documentation and traceability onto a live, undocumented agent is one of the most expensive forms of technical debt we encounter, and it tends to surface at exactly the wrong moment, during an audit or an incident.

Putting it together

Going from pilot to production is mostly unglamorous engineering: tracing, retries, scoped permissions, an eval harness, and a budget alert. None of it is novel, and that is the point. The organisations succeeding with agents in 2026 are not the ones with the cleverest prompts; they are the ones that applied ordinary operational rigour to an extraordinary new capability. Run an agent through these five gates and you will know — with evidence, not optimism — whether it is ready.

If you want a second pair of senior eyes on a Foundry agent before go-live, our AI and data platform engineering team does exactly this kind of production hardening. No body shop, no fluff — practitioners who have shipped it.

FAQ

What does it mean for an Azure AI Foundry agent to be production-ready?

Production-ready means the agent meets the same operational bar as any other critical service: it has measurable reliability targets, end-to-end tracing, enforced authorization on every tool and data call, an automated evaluation suite that gates releases, and cost controls with budget alerts. A demo that works on a happy path is not production-ready. The 2026 shift is precisely this move from pilots to governed, observable, accountable production systems.

How is Azure AI Foundry related to the Microsoft Agent Framework?

Azure AI Foundry is the central platform for building, deploying, and governing agents, while the Microsoft Agent Framework 1.0 (generally available since 3 April 2026) is the open-source runtime for .NET and Python that you build agents with. The framework provides the A2A agent-to-agent protocol and Model Context Protocol integration; Foundry provides the hosting, identity, observability, evaluation, and governance plane around it.

Do I need evaluations before going live, or can I add them later?

You need them before go-live. Without an automated evaluation suite you have no objective way to detect quality regressions, prompt-injection susceptibility, or hallucination rates when you change a prompt, model version, or tool. Treat evaluation as a release gate, not a post-launch nice-to-have. We have repeatedly seen teams ship without it and discover regressions only through user complaints.

What are the most common reasons Foundry agents fail in production?

The recurring failures are missing observability (you cannot see why an agent did something), over-broad tool permissions that turn a prompt injection into a real breach, no evaluation gate so quality silently drifts, and uncapped token spend. None of these are model problems; they are operations and governance gaps that a readiness checklist closes.

How does this checklist relate to EU regulation such as the EU AI Act and NIS2?

Many enterprise agents touch personal data, make or support decisions, or run in essential-entity environments, which brings the EU AI Act, GDPR, NIS2, and DORA into scope. The checklist's controls — logging and traceability, human oversight, evaluation evidence, and access governance — map directly onto the documentation and risk-management obligations these regimes impose. Building them in from day one is far cheaper than retrofitting evidence later.

How long does it take to get a pilot agent to production?

For a scoped agent with a clear use case, hardening a working pilot into a governed production service typically takes a few focused weeks, dominated by observability wiring, the evaluation suite, security review, and cost controls rather than model work. The timeline grows with the number of tools, data sources, and the regulatory classification of the use case.