Aden | Rate Limits, Throttling, and Retries: The "Boring" Stuff That Kills AI Apps

In traditional microservices, a 429 "Too Many Requests" error is a nuisance. In Generative AI, it is a catastrophic failure of architecture.

Most developers treat OpenAI or Anthropic APIs like just another REST endpoint. They wrap the call in a try/except block, add a simple exponential backoff, and call it a day.

This is why your AI app is failing in production.

LLM APIs are fundamentally different from standard CRUD APIs in three ways:

Variable Latency: A request can take 200ms or 45 seconds.
Variable Cost: One request can consume 10 tokens; the next can consume 100,000.
Dual-Axis Limiting: You are limited by Requests Per Minute (RPM) and Tokens Per Minute (TPM).

Here is the technical breakdown of why the "boring" resiliency layer is actually the most critical part of your stack, and how to architect it correctly.

1. The TPM Trap (Tokens Per Minute)

Standard rate limiters use a "Leaky Bucket" algorithm based on request count. If your limit is 100 RPM, you allow 100 HTTP calls. Simple.

LLM providers enforce TPM (Tokens Per Minute). This destroys standard logic.

The Scenario

You have a limit of 10,000 TPM. You send 10 requests.

The Trap

If 9 requests are "Hello World" (10 tokens each) and the 10th request is a massive RAG context (9,000 tokens), you are suddenly at 90% capacity despite low RPM.

The Fix: Token-Aware Gateways

You cannot rely on the provider's 429 error, because by the time you receive it, you have already dropped packets. You must implement Client-Side Token Estimation. Before your application sends a request, it must:

Tokenize the input prompt locally (using tiktoken or similar).
Check the estimated cost against a local Redis-backed token bucket.
Reject or Queue the request internally before it ever hits the wire.

2. The "Thundering Herd" of Retries

In standard web apps, if a request fails, we retry immediately or after a short delay. In AI apps, a retry is expensive—both in money and time.

If your GPT-4 call times out after 30 seconds, and you blindly retry:

User Experience: The user waits 30s + 30s = 1 minute. They have already abandoned the tab.
Resource Exhaustion: You are now doubling the load on your TPM limit for a request that likely failed due to congestion.

The Fix: Circuit Breakers & Fallbacks

Do not just retry. Fallback. If the primary model (e.g., GPT-4) is timing out or throwing 429s, your resiliency layer should trip a Circuit Breaker and instantly route traffic to a cheaper/faster model (e.g., GPT-3.5-Turbo or Claude Haiku) or a backup provider (Azure OpenAI vs. OpenAI).

3. Priority Queueing: Not All Users Are Equal

In a FIFO (First-In-First-Out) queue, a free-tier user running a massive batch job can block your Enterprise CEO who just wants a quick summary.

Because LLM requests hold open connections for seconds (or minutes), head-of-line blocking is fatal.

The Fix: Semantic Priority Queues

Your rate limiter must be context-aware.

Tier 1 (Real-time/VIP): Skip the queue. Direct access to reserved TPM capacity.
Tier 2 (Standard): Standard FIFO queue.
Tier 3 (Batch/Background): These jobs only run when the global TPM usage is below 50%.

This requires moving your rate limiting logic out of the application code and into a dedicated AI Gateway (like Kong, specialized proxies, or custom Redis logic).

4. The Ultimate Rate Limit Fix: Semantic Caching

The best way to handle a rate limit is to never make the request.

Standard caching (URL-based) doesn't work for LLMs because prompts vary slightly. "Who is the CEO of Apple?" and "Who is Apple's CEO?" are different strings but the same semantic query.

The Fix: Vector-Based Caching

Embed the incoming user query.
Search your Vector DB (Redis/Pinecone) for a similar query (Similarity > 0.95).
If found, return the cached LLM response.

This effectively turns your expensive, slow, rate-limited LLM calls into instant, free database lookups for common queries.

Summary: The "Gateway" Pattern

If you are calling openai.chat.completions.create directly from your backend services, you are doing it wrong.

You need a centralized Egress Gateway that handles:

Unified Rate Limiting: Managing TPM across all your services.
Provider Load Balancing: Spreading load across multiple API keys or regions.
Automatic Retries & Fallbacks: Handling the "boring" stuff so your application developers can focus on the prompts.

Reliability in the AI era isn't about writing better code; it's about building better plumbing.

Get start with Aden

Rate Limits, Throttling, and Retries: The "Boring" Stuff That Kills AI Apps