Skip to main content
All posts
AI & Data5 min read

AI Gateway Pattern on Azure: Centralized LLM Access, Rate Limiting, and Cost Control

How to implement an AI Gateway using Azure API Management to centralize LLM access, enforce rate limits, allocate costs per team, and maintain compliance across enterprise AI workloads.

Published

When one team experiments with Azure OpenAI, governance is simple. When ten teams build production AI features simultaneously, chaos follows: unpredictable costs, no usage visibility, inconsistent content safety policies, and no way to trace which application generated which tokens.

The AI Gateway pattern solves this by centralising LLM access through a managed proxy — typically Azure API Management. Every LLM call routes through the gateway, which enforces rate limits, tracks costs, logs requests, and balances load across deployments.

Why You Need an AI Gateway

Problem 1: Cost Visibility

Without a gateway, Azure OpenAI costs appear as a single line item. You cannot answer:

  • Which team consumed how many tokens?
  • Which application has the highest cost?
  • Is that spike in spending a legitimate workload or a runaway loop?

Problem 2: Rate Limiting

Azure OpenAI has per-deployment rate limits (tokens per minute). Without a gateway:

  • One team's batch job can exhaust the quota, blocking other teams
  • No fair-share allocation between teams
  • No protection against runaway agents consuming the entire quota

Problem 3: Compliance

Different workloads may require different content safety settings, different models, or different logging policies. Without a gateway, each team configures these independently — if at all.

Architecture

Loading diagram...

Request Flow Through the AI Gateway

Loading diagram...

APIM Policy: Token-Based Rate Limiting

XML
<policies>
    <inbound>
        <!-- Authenticate the calling application -->
        <validate-jwt header-name="Authorization" 
                      failed-validation-httpcode="401">
            <openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
            <required-claims>
                <claim name="roles" match="any">
                    <value>AI.Consumer</value>
                </claim>
            </required-claims>
        </validate-jwt>
        
        <!-- Rate limit by subscription: 100K tokens per minute -->
        <azure-openai-token-limit 
            counter-key="@(context.Subscription.Id)"
            tokens-per-minute="100000"
            estimate-prompt-tokens="true"
            remaining-tokens-variable-name="remainingTokens" />
        
        <!-- Track token usage for cost allocation -->
        <azure-openai-emit-token-metric 
            namespace="AIGateway">
            <dimension name="Team" value="@(context.Subscription.Name)" />
            <dimension name="Application" value="@(context.Request.Headers.GetValueOrDefault("X-App-Name", "unknown"))" />
            <dimension name="Model" value="@(context.Request.MatchedParameters["deployment-id"])" />
        </azure-openai-emit-token-metric>
        
        <!-- Route to the appropriate backend -->
        <set-backend-service backend-id="openai-load-balancer" />
    </inbound>
    
    <backend>
        <forward-request timeout="120" />
    </backend>
    
    <outbound>
        <!-- Log token consumption -->
        <log-to-eventhub logger-id="ai-gateway-logger">@{
            var body = context.Response.Body.As<JObject>(preserveContent: true);
            return new JObject(
                new JProperty("timestamp", DateTime.UtcNow),
                new JProperty("team", context.Subscription.Name),
                new JProperty("model", context.Request.MatchedParameters["deployment-id"]),
                new JProperty("promptTokens", body?["usage"]?["prompt_tokens"]),
                new JProperty("completionTokens", body?["usage"]?["completion_tokens"]),
                new JProperty("totalTokens", body?["usage"]?["total_tokens"])
            ).ToString();
        }</log-to-eventhub>
    </outbound>
</policies>

Load Balancing Across Azure OpenAI Instances

XML
<backend id="openai-load-balancer">
    <load-balancer>
        <backend-pool>
            <backend id="openai-eastus" priority="1" weight="50" />
            <backend id="openai-westeurope" priority="1" weight="50" />
            <backend id="openai-swedencentral" priority="2" weight="100" />
        </backend-pool>
    </load-balancer>
</backend>

Priority-based routing with weight distribution. If primary backends (East US, West Europe) are saturated, traffic overflows to Sweden Central.

Circuit Breaker for Resilience

XML
<backend id="openai-eastus">
    <circuit-breaker>
        <rule name="openai-breaker"
              accept-retry-after="true"
              trip-duration="PT30S"
              min-throughput="10"
              failure-condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)"
              failure-threshold="0.5" />
    </circuit-breaker>
</backend>

When 50% of requests to a backend fail (429 rate limit or 5xx errors), the circuit breaker opens for 30 seconds, routing traffic to other backends. The accept-retry-after parameter respects the Retry-After header from Azure OpenAI.

Cost Allocation Dashboard

With token metrics flowing to Log Analytics, build a cost dashboard:

Kusto
// KQL: Daily token consumption per team
ApiManagementGatewayLogs
| where TimeGenerated > ago(30d)
| extend promptTokens = toint(ResponseBody.usage.prompt_tokens)
| extend completionTokens = toint(ResponseBody.usage.completion_tokens)
| summarize 
    TotalPromptTokens = sum(promptTokens),
    TotalCompletionTokens = sum(completionTokens),
    RequestCount = count()
    by bin(TimeGenerated, 1d), SubscriptionName
| extend EstimatedCostEUR = 
    (TotalPromptTokens / 1000000.0 * 2.50) +   // GPT-4o prompt rate
    (TotalCompletionTokens / 1000000.0 * 10.00)  // GPT-4o completion rate
| order by TimeGenerated desc, EstimatedCostEUR desc

Prompt Logging and Compliance

For regulated industries, log prompts and responses for audit:

XML
<log-to-eventhub logger-id="prompt-audit-logger">@{
    var requestBody = context.Request.Body.As<JObject>(preserveContent: true);
    var responseBody = context.Response.Body.As<JObject>(preserveContent: true);
    
    return new JObject(
        new JProperty("timestamp", DateTime.UtcNow),
        new JProperty("team", context.Subscription.Name),
        new JProperty("prompt", requestBody?["messages"]?.ToString()),
        new JProperty("response", responseBody?["choices"]?[0]?["message"]?["content"]),
        new JProperty("model", context.Request.MatchedParameters["deployment-id"])
    ).ToString();
}</log-to-eventhub>

Important: Prompt logging may contain PII. Ensure:

  • Logs are stored in a compliant location (EU region for GDPR)
  • Access to prompt logs is restricted to authorised personnel
  • Retention policies align with your data governance framework
  • PII detection runs on logged prompts (Azure AI Content Safety)

Semantic Caching

For repeated queries (FAQ-style), add a semantic cache to reduce token consumption:

XML
<inbound>
    <azure-openai-semantic-cache-lookup 
        score-threshold="0.95"
        embeddings-backend-id="embedding-backend"
        embeddings-backend-auth="system-assigned" />
</inbound>

<outbound>
    <azure-openai-semantic-cache-store duration="3600" />
</outbound>

Semantic caching compares the embedding similarity of incoming prompts against cached responses. At a 0.95 threshold, only near-identical questions return cached results — reducing false cache hits while saving tokens on repeated queries.

Implementation Recommendations

  1. Start with APIM — Do not build a custom gateway. APIM's AI-specific policies (token limiting, semantic caching, emit metrics) handle 90% of requirements.
  2. One subscription per team — Clean cost allocation from day one.
  3. Deploy multiple Azure OpenAI instances — In different regions for resilience and to aggregate rate limits.
  4. Enable circuit breakers — Protect against cascade failures when one region is throttled.
  5. Log everything, restrict access — Full prompt logging for compliance, strict RBAC on log access.
  6. Set budget alerts — Configure Azure Monitor alerts when token consumption exceeds thresholds per team.

Need to implement an AI Gateway for your enterprise? Contact us — we help organisations centralise LLM access with cost control, compliance, and resilience built in.

Topics

AI gateway patternAzure API Management LLMLLM rate limitingAI cost allocationenterprise AI governance

Frequently Asked Questions

An AI Gateway is a centralised proxy that sits between your applications and LLM endpoints (Azure OpenAI, etc.). It provides rate limiting, cost allocation, usage analytics, prompt logging, load balancing across model deployments, and compliance controls — similar to what an API gateway does for REST APIs.

Expert engagement

Need expert guidance?

Our team specializes in cloud architecture, security, AI platforms, and DevSecOps. Let's discuss how we can help your organization.

Get in touchNo commitment · No sales pressure

Related articles

All posts