AI Gateway Pattern on Azure: Centralized LLM Access, Rate Limiting, and Cost Control
How to implement an AI Gateway using Azure API Management to centralize LLM access, enforce rate limits, allocate costs per team, and maintain compliance across enterprise AI workloads.
When one team experiments with Azure OpenAI, governance is simple. When ten teams build production AI features simultaneously, chaos follows: unpredictable costs, no usage visibility, inconsistent content safety policies, and no way to trace which application generated which tokens.
The AI Gateway pattern solves this by centralising LLM access through a managed proxy — typically Azure API Management. Every LLM call routes through the gateway, which enforces rate limits, tracks costs, logs requests, and balances load across deployments.
Why You Need an AI Gateway
Problem 1: Cost Visibility
Without a gateway, Azure OpenAI costs appear as a single line item. You cannot answer:
- Which team consumed how many tokens?
- Which application has the highest cost?
- Is that spike in spending a legitimate workload or a runaway loop?
Problem 2: Rate Limiting
Azure OpenAI has per-deployment rate limits (tokens per minute). Without a gateway:
- One team's batch job can exhaust the quota, blocking other teams
- No fair-share allocation between teams
- No protection against runaway agents consuming the entire quota
Problem 3: Compliance
Different workloads may require different content safety settings, different models, or different logging policies. Without a gateway, each team configures these independently — if at all.
Architecture
Request Flow Through the AI Gateway
APIM Policy: Token-Based Rate Limiting
<policies>
<inbound>
<!-- Authenticate the calling application -->
<validate-jwt header-name="Authorization"
failed-validation-httpcode="401">
<openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
<required-claims>
<claim name="roles" match="any">
<value>AI.Consumer</value>
</claim>
</required-claims>
</validate-jwt>
<!-- Rate limit by subscription: 100K tokens per minute -->
<azure-openai-token-limit
counter-key="@(context.Subscription.Id)"
tokens-per-minute="100000"
estimate-prompt-tokens="true"
remaining-tokens-variable-name="remainingTokens" />
<!-- Track token usage for cost allocation -->
<azure-openai-emit-token-metric
namespace="AIGateway">
<dimension name="Team" value="@(context.Subscription.Name)" />
<dimension name="Application" value="@(context.Request.Headers.GetValueOrDefault("X-App-Name", "unknown"))" />
<dimension name="Model" value="@(context.Request.MatchedParameters["deployment-id"])" />
</azure-openai-emit-token-metric>
<!-- Route to the appropriate backend -->
<set-backend-service backend-id="openai-load-balancer" />
</inbound>
<backend>
<forward-request timeout="120" />
</backend>
<outbound>
<!-- Log token consumption -->
<log-to-eventhub logger-id="ai-gateway-logger">@{
var body = context.Response.Body.As<JObject>(preserveContent: true);
return new JObject(
new JProperty("timestamp", DateTime.UtcNow),
new JProperty("team", context.Subscription.Name),
new JProperty("model", context.Request.MatchedParameters["deployment-id"]),
new JProperty("promptTokens", body?["usage"]?["prompt_tokens"]),
new JProperty("completionTokens", body?["usage"]?["completion_tokens"]),
new JProperty("totalTokens", body?["usage"]?["total_tokens"])
).ToString();
}</log-to-eventhub>
</outbound>
</policies>Load Balancing Across Azure OpenAI Instances
<backend id="openai-load-balancer">
<load-balancer>
<backend-pool>
<backend id="openai-eastus" priority="1" weight="50" />
<backend id="openai-westeurope" priority="1" weight="50" />
<backend id="openai-swedencentral" priority="2" weight="100" />
</backend-pool>
</load-balancer>
</backend>Priority-based routing with weight distribution. If primary backends (East US, West Europe) are saturated, traffic overflows to Sweden Central.
Circuit Breaker for Resilience
<backend id="openai-eastus">
<circuit-breaker>
<rule name="openai-breaker"
accept-retry-after="true"
trip-duration="PT30S"
min-throughput="10"
failure-condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)"
failure-threshold="0.5" />
</circuit-breaker>
</backend>When 50% of requests to a backend fail (429 rate limit or 5xx errors), the circuit breaker opens for 30 seconds, routing traffic to other backends. The accept-retry-after parameter respects the Retry-After header from Azure OpenAI.
Cost Allocation Dashboard
With token metrics flowing to Log Analytics, build a cost dashboard:
// KQL: Daily token consumption per team
ApiManagementGatewayLogs
| where TimeGenerated > ago(30d)
| extend promptTokens = toint(ResponseBody.usage.prompt_tokens)
| extend completionTokens = toint(ResponseBody.usage.completion_tokens)
| summarize
TotalPromptTokens = sum(promptTokens),
TotalCompletionTokens = sum(completionTokens),
RequestCount = count()
by bin(TimeGenerated, 1d), SubscriptionName
| extend EstimatedCostEUR =
(TotalPromptTokens / 1000000.0 * 2.50) + // GPT-4o prompt rate
(TotalCompletionTokens / 1000000.0 * 10.00) // GPT-4o completion rate
| order by TimeGenerated desc, EstimatedCostEUR descPrompt Logging and Compliance
For regulated industries, log prompts and responses for audit:
<log-to-eventhub logger-id="prompt-audit-logger">@{
var requestBody = context.Request.Body.As<JObject>(preserveContent: true);
var responseBody = context.Response.Body.As<JObject>(preserveContent: true);
return new JObject(
new JProperty("timestamp", DateTime.UtcNow),
new JProperty("team", context.Subscription.Name),
new JProperty("prompt", requestBody?["messages"]?.ToString()),
new JProperty("response", responseBody?["choices"]?[0]?["message"]?["content"]),
new JProperty("model", context.Request.MatchedParameters["deployment-id"])
).ToString();
}</log-to-eventhub>Important: Prompt logging may contain PII. Ensure:
- Logs are stored in a compliant location (EU region for GDPR)
- Access to prompt logs is restricted to authorised personnel
- Retention policies align with your data governance framework
- PII detection runs on logged prompts (Azure AI Content Safety)
Semantic Caching
For repeated queries (FAQ-style), add a semantic cache to reduce token consumption:
<inbound>
<azure-openai-semantic-cache-lookup
score-threshold="0.95"
embeddings-backend-id="embedding-backend"
embeddings-backend-auth="system-assigned" />
</inbound>
<outbound>
<azure-openai-semantic-cache-store duration="3600" />
</outbound>Semantic caching compares the embedding similarity of incoming prompts against cached responses. At a 0.95 threshold, only near-identical questions return cached results — reducing false cache hits while saving tokens on repeated queries.
Implementation Recommendations
- Start with APIM — Do not build a custom gateway. APIM's AI-specific policies (token limiting, semantic caching, emit metrics) handle 90% of requirements.
- One subscription per team — Clean cost allocation from day one.
- Deploy multiple Azure OpenAI instances — In different regions for resilience and to aggregate rate limits.
- Enable circuit breakers — Protect against cascade failures when one region is throttled.
- Log everything, restrict access — Full prompt logging for compliance, strict RBAC on log access.
- Set budget alerts — Configure Azure Monitor alerts when token consumption exceeds thresholds per team.
Need to implement an AI Gateway for your enterprise? Contact us — we help organisations centralise LLM access with cost control, compliance, and resilience built in.
Topics