What is an AI Gateway?

An AI Gateway is a centralised proxy that sits between your applications and LLM endpoints (Azure OpenAI, etc.). It provides rate limiting, cost allocation, usage analytics, prompt logging, load balancing across model deployments, and compliance controls — similar to what an API gateway does for REST APIs.

Why use Azure API Management as an AI Gateway?

APIM already provides the building blocks: policies for rate limiting, request/response transformation, authentication, logging, and routing. With token-based rate limiting policies (emit-token-metric), APIM can enforce LLM-specific governance without building a custom gateway from scratch.

How do you allocate AI costs per team?

Use APIM subscriptions (one per team), configure emit-token-metric policies to track token consumption per subscription key, and build cost dashboards from APIM analytics. Each team's token usage multiplied by the per-token rate gives you precise cost allocation.

AI Gateway Pattern on Azure: Centralized LLM Access, Rate Limiting, and Cost Control

When one team experiments with Azure OpenAI, governance is simple. When ten teams build production AI features simultaneously, chaos follows: unpredictable costs, no usage visibility, inconsistent content safety policies, and no way to trace which application generated which tokens.

The AI Gateway pattern solves this by centralising LLM access through a managed proxy — typically Azure API Management. Every LLM call routes through the gateway, which enforces rate limits, tracks costs, logs requests, and balances load across deployments.

Why You Need an AI Gateway

Problem 1: Cost Visibility

Without a gateway, Azure OpenAI costs appear as a single line item. You cannot answer:

Which team consumed how many tokens?
Which application has the highest cost?
Is that spike in spending a legitimate workload or a runaway loop?

Problem 2: Rate Limiting

Azure OpenAI has per-deployment rate limits (tokens per minute). Without a gateway:

One team's batch job can exhaust the quota, blocking other teams
No fair-share allocation between teams
No protection against runaway agents consuming the entire quota

Problem 3: Compliance

Different workloads may require different content safety settings, different models, or different logging policies. Without a gateway, each team configures these independently — if at all.

Architecture

Loading diagram...

Request Flow Through the AI Gateway

Loading diagram...

APIM Policy: Token-Based Rate Limiting

XML

<policies>
    <inbound>
        <!-- Authenticate the calling application -->
        <validate-jwt header-name="Authorization" 
                      failed-validation-httpcode="401">
            <openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
            <required-claims>
                <claim name="roles" match="any">
                    <value>AI.Consumer</value>
                </claim>
            </required-claims>
        </validate-jwt>
        
        <!-- Rate limit by subscription: 100K tokens per minute -->
        <azure-openai-token-limit 
            counter-key="@(context.Subscription.Id)"
            tokens-per-minute="100000"
            estimate-prompt-tokens="true"
            remaining-tokens-variable-name="remainingTokens" />
        
        <!-- Track token usage for cost allocation -->
        <azure-openai-emit-token-metric 
            namespace="AIGateway">
            <dimension name="Team" value="@(context.Subscription.Name)" />
            <dimension name="Application" value="@(context.Request.Headers.GetValueOrDefault("X-App-Name", "unknown"))" />
            <dimension name="Model" value="@(context.Request.MatchedParameters["deployment-id"])" />
        </azure-openai-emit-token-metric>
        
        <!-- Route to the appropriate backend -->
        <set-backend-service backend-id="openai-load-balancer" />
    </inbound>
    
    <backend>
        <forward-request timeout="120" />
    </backend>
    
    <outbound>
        <!-- Log token consumption -->
        <log-to-eventhub logger-id="ai-gateway-logger">@{
            var body = context.Response.Body.As<JObject>(preserveContent: true);
            return new JObject(
                new JProperty("timestamp", DateTime.UtcNow),
                new JProperty("team", context.Subscription.Name),
                new JProperty("model", context.Request.MatchedParameters["deployment-id"]),
                new JProperty("promptTokens", body?["usage"]?["prompt_tokens"]),
                new JProperty("completionTokens", body?["usage"]?["completion_tokens"]),
                new JProperty("totalTokens", body?["usage"]?["total_tokens"])
            ).ToString();
        }</log-to-eventhub>
    </outbound>
</policies>

Load Balancing Across Azure OpenAI Instances

XML

<backend id="openai-load-balancer">
    <load-balancer>
        <backend-pool>
            <backend id="openai-eastus" priority="1" weight="50" />
            <backend id="openai-westeurope" priority="1" weight="50" />
            <backend id="openai-swedencentral" priority="2" weight="100" />
        </backend-pool>
    </load-balancer>
</backend>

Priority-based routing with weight distribution. If primary backends (East US, West Europe) are saturated, traffic overflows to Sweden Central.

Circuit Breaker for Resilience

XML

<backend id="openai-eastus">
    <circuit-breaker>
        <rule name="openai-breaker"
              accept-retry-after="true"
              trip-duration="PT30S"
              min-throughput="10"
              failure-condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)"
              failure-threshold="0.5" />
    </circuit-breaker>
</backend>

When 50% of requests to a backend fail (429 rate limit or 5xx errors), the circuit breaker opens for 30 seconds, routing traffic to other backends. The accept-retry-after parameter respects the Retry-After header from Azure OpenAI.

Cost Allocation Dashboard

With token metrics flowing to Log Analytics, build a cost dashboard:

Kusto

// KQL: Daily token consumption per team
ApiManagementGatewayLogs
| where TimeGenerated > ago(30d)
| extend promptTokens = toint(ResponseBody.usage.prompt_tokens)
| extend completionTokens = toint(ResponseBody.usage.completion_tokens)
| summarize 
    TotalPromptTokens = sum(promptTokens),
    TotalCompletionTokens = sum(completionTokens),
    RequestCount = count()
    by bin(TimeGenerated, 1d), SubscriptionName
| extend EstimatedCostEUR = 
    (TotalPromptTokens / 1000000.0 * 2.50) +   // GPT-4o prompt rate
    (TotalCompletionTokens / 1000000.0 * 10.00)  // GPT-4o completion rate
| order by TimeGenerated desc, EstimatedCostEUR desc

Prompt Logging and Compliance

For regulated industries, log prompts and responses for audit:

XML

<log-to-eventhub logger-id="prompt-audit-logger">@{
    var requestBody = context.Request.Body.As<JObject>(preserveContent: true);
    var responseBody = context.Response.Body.As<JObject>(preserveContent: true);
    
    return new JObject(
        new JProperty("timestamp", DateTime.UtcNow),
        new JProperty("team", context.Subscription.Name),
        new JProperty("prompt", requestBody?["messages"]?.ToString()),
        new JProperty("response", responseBody?["choices"]?[0]?["message"]?["content"]),
        new JProperty("model", context.Request.MatchedParameters["deployment-id"])
    ).ToString();
}</log-to-eventhub>

Important: Prompt logging may contain PII. Ensure:

Logs are stored in a compliant location (EU region for GDPR)
Access to prompt logs is restricted to authorised personnel
Retention policies align with your data governance framework
PII detection runs on logged prompts (Azure AI Content Safety)

Semantic Caching

For repeated queries (FAQ-style), add a semantic cache to reduce token consumption:

XML

<inbound>
    <azure-openai-semantic-cache-lookup 
        score-threshold="0.95"
        embeddings-backend-id="embedding-backend"
        embeddings-backend-auth="system-assigned" />
</inbound>

<outbound>
    <azure-openai-semantic-cache-store duration="3600" />
</outbound>

Semantic caching compares the embedding similarity of incoming prompts against cached responses. At a 0.95 threshold, only near-identical questions return cached results — reducing false cache hits while saving tokens on repeated queries.

Implementation Recommendations

Start with APIM — Do not build a custom gateway. APIM's AI-specific policies (token limiting, semantic caching, emit metrics) handle 90% of requirements.
One subscription per team — Clean cost allocation from day one.
Deploy multiple Azure OpenAI instances — In different regions for resilience and to aggregate rate limits.
Enable circuit breakers — Protect against cascade failures when one region is throttled.
Log everything, restrict access — Full prompt logging for compliance, strict RBAC on log access.
Set budget alerts — Configure Azure Monitor alerts when token consumption exceeds thresholds per team.

Need to implement an AI Gateway for your enterprise? Contact us — we help organisations centralise LLM access with cost control, compliance, and resilience built in.