A weekend AI prototype is one API call. A production AI application is a system: models behind a gateway, tools and orchestration, queues for slow work, a vector store for knowledge, caching for cost, and observability so you can see what the non-deterministic thing in the middle is actually doing. This guide is an architecture map the layers of a real AI app in 2026, what each one is for, and the decisions that matter when you move from "it works on my machine" to "it works for thousands of users."
The layers you will recognise by the end
- A model gateway layer for routing, fallback, and cost control
- Orchestration: where agents, tools, and workflows live
- Queues and workers for slow, bursty, or long-running AI work
- The data layer: vector stores, caches, and your system of record
- Observability and evaluation as first-class, not afterthoughts
Info
Why a map helps
AI apps fail in the seams between layers, not inside any one box. Seeing the whole stack at once tells you where to put retries, where cost accumulates, where latency hides, and where a single bad input can cascade. Architecture is the difference between a prototype and a product.
The stack at a glance
┌──────────────────────────────────────────────┐
USER │ App / API layer (auth, rate limits, UX) │
├──────────────────────────────────────────────┤
│ Orchestration (agents, tools, workflows) │
├───────────────┬──────────────────────────────┤
│ Model gateway │ Queues + workers (async work)│
│ (route/cache/ ├──────────────────────────────┤
│ fallback) │ Data: vector store, cache, DB│
├───────────────┴──────────────────────────────┤
│ Observability + evaluation (traces, evals) │
└──────────────────────────────────────────────┘
Layer 1: The model gateway
Do not let your application call model providers directly from a dozen places. Put a gateway in front: a single layer that routes requests to the right model, handles authentication and retries, enforces budgets, applies caching, and critically fails over to an alternate model or provider when one is slow or down. This one layer buys you resilience, cost control, and the freedom to swap models without touching application code.
class ModelGateway:
def __init__(self, primary, fallback, cache, meter):
self.primary, self.fallback, self.cache, self.meter = primary, fallback, cache, meter
def complete(self, req: Request) -> Response:
if (hit := self.cache.get(req.cache_key())): # response cache: skip the call
return hit
for model in (self.primary, self.fallback): # provider fallback on failure
try:
resp = model.complete(req, timeout=20)
self.cache.set(req.cache_key(), resp, ttl=3600)
self.meter.record(req.tenant, resp.tokens, resp.cost) # budgets + observability
return resp
except (ProviderTimeout, ProviderError):
continue # try the next provider
raise AllModelsUnavailable() # everything is down — fail loudly
# The whole app talks to gateway.complete(); swapping models is now a config change.
✓ Pros
- One place to enforce per-tenant cost and rate limits
- Provider/model fallback when one is degraded
- Centralised prompt caching and response caching
- Swap or A/B models behind a stable internal interface
✕ Cons
- A new component to run and make highly available
- A potential single point of failure if built carelessly
- Adds a hop of latency keep it thin
- Tempting to over-engineer; start minimal
Layer 2: Orchestration
This is where the actual AI logic lives: prompt assembly, tool calling, agent loops, and multi-step workflows. Keep it explicit and testable. The key architectural call is how much autonomy you grant a fixed, deterministic workflow (chain these steps in this order) is easier to reason about and cheaper than a free-roaming agent, and most production use cases need the former, not the latter. Reach for autonomy only when the task genuinely cannot be expressed as a fixed pipeline.
Tip
Prefer workflows to agents when you can
A deterministic chain of model calls is predictable, debuggable, and cheaper. Full agentic loops are powerful but introduce variance and cost. Start with the simplest orchestration that solves the problem and add autonomy only where it earns its keep.
Layer 3: Queues and workers
AI work is slow and bursty: a generation can take many seconds, traffic spikes, and rate limits throttle you. Doing that work inside a web request is a recipe for timeouts and a terrible user experience. Push it to a queue. The request enqueues a job and returns immediately; a worker pool processes jobs, respecting provider rate limits, retrying transient failures with backoff, and streaming or notifying results when ready.
# Web request: enqueue and return immediately with 202 Accepted — never block on a model.
@app.post("/summaries")
def create_summary(doc_id: int):
job = summarize.delay(doc_id) # hand off to the task queue
return {"job_id": job.id, "status": "processing"}, 202
# Worker: off the request path, rate-limited to the provider, retried with backoff.
@task(max_retries=5, retry_backoff=True, rate_limit="60/m")
def summarize(doc_id: int):
doc = Document.get(doc_id)
resp = gateway.complete(Request(prompt=SUMMARY_PROMPT + doc.text)) # via the gateway
Summary.create(doc_id=doc_id, text=resp.text)
notify(doc.owner_id, "summary_ready", doc_id) # push the result when done
Enqueue, do not block
Accept the request, create a job, return a handle. Never tie up a web worker for 20 seconds waiting on a model.
Rate-limit at the worker
Concentrate provider rate-limit handling in the worker pool so a traffic spike queues gracefully instead of erroring out.
Retry with backoff and a dead-letter queue
Transient model/provider errors are normal. Retry sensibly, and route persistent failures somewhere visible instead of losing them.
Layer 4: The data layer
Three distinct stores serve three distinct jobs, and conflating them causes pain. Your system of record (a relational database) holds users, jobs, and results. A vector store holds embeddings for retrieval. And a cache (often Redis) holds prompt/response caches, rate-limit counters, and session state. Treat caching as a first-class cost-control mechanism: caching stable system prompts and repeated queries can cut both latency and spend substantially.
System of record (Postgres) ▸ users, conversations, jobs, audit, billing
Vector store (pgvector/Qdrant) ▸ embeddings for RAG retrieval
Cache (Redis) ▸ prompt cache, response cache, rate limits, sessions
system of record, vector store, and cache — each with a distinct job
Layer 5: Observability and evaluation
You cannot operate a non-deterministic system you cannot see. AI observability goes beyond ordinary logging: you need traces of every model call (prompt, response, tokens, latency, cost), linked across the steps of a request, plus continuous evaluation that scores output quality over time. Without this, you discover problems from angry users instead of dashboards, and you have no way to know whether last week's prompt change made things better or worse.
✓ Pros
- Per-call traces: prompt, output, tokens, latency, cost, model version
- Request-level correlation across orchestration steps
- Cost and latency dashboards broken down by feature and tenant
- An evaluation pipeline that runs on every prompt/model change
✕ Cons
- No flying blind on a probabilistic system
- No prompt changes shipped without an eval to catch regressions
- No logging prompts/outputs without a PII and retention policy
- No treating cost as someone else's problem it compounds fast
Warning
Cost is an architecture concern
In AI apps, spend is driven by design decisions how much context you send, how many steps an agent takes, whether you cache, which model handles which job. Bolting cost control on at the end is painful. Bake budgets, caching, and model routing into the architecture from day one.
Putting it together
A request flows in through the app layer (auth, rate limit), hits orchestration (assemble context, maybe retrieve from the vector store, call tools), which talks to models through the gateway (routing, caching, fallback); slow work is offloaded to queues and workers; results land in the system of record; and every step emits traces to your observability stack, which feeds your evals. Each layer has one job, and the seams between them are where you place retries, budgets, and guardrails.
Start small, but leave the seams
You do not need every layer on day one a prototype can collapse several into one process. But design the boundaries deliberately so you can pull them apart as you grow: a thin gateway interface, a queue for anything slow, a clean split between your three data stores, and tracing from the very first request. Retro-fitting these into a tangled prototype is the expensive path many teams learn the hard way.
! Common mistakes to avoid
-
✕Calling model providers directly from all over the app.
✓Route everything through one gateway for caching, budgets, retries, and provider fallback.
-
✕Running slow model calls inside the web request.
✓Enqueue them; let a worker pool do the slow work and notify or stream results.
-
✕Reaching for a full agent when a fixed workflow would do.
✓Prefer deterministic chains; they are cheaper, predictable, and easier to debug.
-
✕Treating cost and observability as afterthoughts.
✓Bake caching, budgets, and per-call tracing into the architecture from day one.
? Frequently asked questions
Do I need all these layers on day one? +
No. A prototype can collapse several into one process. But design the boundaries deliberately — a thin gateway interface, a queue for slow work, separate data stores, and tracing from the first request — so you can pull them apart as you grow.
Why put a gateway in front of the models? +
One place to enforce cost and rate limits, cache prompts and responses, and fail over to another model or provider when one is slow or down — plus the freedom to swap models without touching app code.
When should I use an agent versus a fixed workflow? +
Default to a deterministic workflow — it is predictable, cheaper, and debuggable. Reach for an autonomous agent only when the task genuinely cannot be expressed as a fixed pipeline.
How do I control AI costs? +
Cost is an architecture concern: cache stable prompts, limit context size, cap agent steps, and route easy work to cheaper models. Track per-feature and per-tenant spend so it never surprises you.
What is different about AI observability? +
Beyond normal logs you need per-call traces (prompt, output, tokens, latency, cost) correlated across a request, plus continuous evaluation that scores output quality over time.
Success
The architecture is the product
With AI, the model is a commodity you can swap; your durable advantage is the system around it reliable orchestration, smart caching, clean data, and the observability to improve continuously. Build the stack well and you can adopt every new model release as a config change instead of a rewrite.
Comments
0No comments yet. Be the first to share your thoughts.