Part 8 begins the LLM half of the series. Calling a model is a network request like any other, but it has its own vocabulary: messages, tokens, cost, and rate limits. This part builds a small, well behaved client around the Anthropic Claude Python SDK, explains tokens and cost in concrete terms, and adds retries and timeouts so a flaky call does not take down your endpoint.
What you will learn
- The messages request and response shape
- What tokens are and how they drive cost
- A reusable client with timeout and retry handling
- Reading usage so you can track spend per request
Info
Setup
Add the SDK with uv add anthropic and set ANTHROPIC_API_KEY in your environment. The examples use the model claude-opus-4-8.
1. The request loop
A call is a list of messages in and one message out. You send a role and content, the model returns content blocks and a stop reason. This is the whole core; everything else is refinement.
from anthropic import Anthropic
client = Anthropic() # reads ANTHROPIC_API_KEY from the environment
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": "Summarize FastAPI in one sentence."}],
)
print(resp.content[0].text)
print("stop reason:", resp.stop_reason)
2. Tokens and cost
Models read and write tokens, not characters. A token is roughly three quarters of a word in English. You pay per input token and per output token, so cost scales with how much context you send and how much the model writes back. The response carries a usage object with the exact counts.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
messages=[{"role": "user", "content": "List three uses for embeddings."}],
)
u = resp.usage
print(f"input tokens: {u.input_tokens}, output tokens: {u.output_tokens}")
# Multiply by the per-token price for each direction to get the cost.
Pro tip
Log input and output tokens on every call. A single number per request makes it trivial to spot the prompt that quietly tripled your bill.
3. A client that survives the real world
Networks fail and APIs rate limit. The SDK retries transient errors automatically, and you can tune it. Wrap the call in your own thin function so timeouts, retries, and error handling live in one place that the rest of the app and your tests can rely on.
import anthropic
from anthropic import Anthropic
client = Anthropic(max_retries=4, timeout=30.0) # SDK retries 429 and 5xx with backoff
def ask(prompt: str) -> str:
try:
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return resp.content[0].text
except anthropic.RateLimitError:
raise RuntimeError("rate limited; back off and retry later")
except anthropic.APIError as e:
raise RuntimeError(f"model call failed: {e}")
Tip
Adaptive thinking
For harder prompts, pass thinking={"type": "adaptive"} and output_config={"effort": "high"}. The model decides how much to reason. Note that sampling parameters like temperature are not used on this model family.
Checkpoint
What mostly determines the cost of a single model call?
4. System prompts and message roles
Two levers shape a response. The system prompt sets the model role and standing rules for the whole conversation, and it is a separate parameter, not a message. The messages list carries the back and forth, where each entry has a role of user or assistant. Keep instructions that always apply in the system prompt and put the actual request in a user message; mixing the two makes prompts harder to reason about and to cache.
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
system="You are a concise technical assistant. Answer in at most three sentences.",
messages=[{"role": "user", "content": "What is an embedding?"}],
)
print(resp.content[0].text)
5. Multi-turn conversations are your job to track
The API is stateless. It does not remember the previous turn, so to hold a conversation you send the whole history every time: the prior user and assistant messages, then the new user message. This is liberating, because you control exactly what context the model sees, and it is also where cost creeps in, since a long history is re-sent on every turn. Part 12 uses this same message list as an agent memory.
messages = [
{"role": "user", "content": "My name is Alex."},
{"role": "assistant", "content": "Nice to meet you, Alex."},
{"role": "user", "content": "What is my name?"}, # needs the history
]
resp = client.messages.create(model="claude-opus-4-8", max_tokens=128, messages=messages)
print(resp.content[0].text) # answers correctly because history was sent
6. Inject the client as a dependency
Do not construct a client inside every handler. Create it once and hand it to routes through FastAPI dependency injection, exactly as you did for other shared resources in Part 5. This gives you one place to configure retries and timeouts, and, crucially, a seam your tests can override with a fake, which is what made the model mocking in Part 7 possible.
from functools import lru_cache
from anthropic import Anthropic
from fastapi import Depends
@lru_cache
def get_model_client() -> Anthropic:
return Anthropic(max_retries=4, timeout=30.0)
async def summarize(text: str, client: Anthropic = Depends(get_model_client)) -> str:
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
messages=[{"role": "user", "content": f"Summarize: {text}"}],
)
return resp.content[0].text
7. Budget tokens and guard cost
Two parameters bound spend. max_tokens caps how much the model can write in a single response, so set it to the real ceiling you need rather than a huge number. On the input side, the cost is whatever you send, so trim context, truncate overly long user input, and for long conversations summarize old turns instead of resending everything. Treat the usage numbers you log as a budget you actively manage, not a statistic you glance at.
def guarded_summary(client: Anthropic, text: str, max_input_chars: int = 8000) -> str:
text = text[:max_input_chars] # cap input you pay for
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=300, # cap output you pay for
messages=[{"role": "user", "content": f"Summarize: {text}"}],
)
u = resp.usage
if u.input_tokens + u.output_tokens > 5000:
# alert or sample-log unusually expensive calls
...
return resp.content[0].text
Tip
Pick the right model for the job
Not every call needs the most capable model. Route simple, high-volume tasks like classification to a smaller, cheaper model and reserve the strongest model for the hard reasoning. The client and message shape stay identical; only the model string changes.
8. Handle stop reasons and partial output
A response is not always complete just because the call returned. The stop reason tells you why the model stopped, and your code should branch on it. A normal finish is end_turn. If the model hit your max_tokens cap, the answer is cut off and you should raise the cap or stream. A refusal means the model declined for safety reasons, in which case the output may not be what you asked for, so do not blindly parse it as a clean result.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
messages=[{"role": "user", "content": "Write a long essay on databases."}],
)
if resp.stop_reason == "max_tokens":
# output is truncated; raise max_tokens or stream the response instead
text = resp.content[0].text + " ... [truncated]"
elif resp.stop_reason == "refusal":
text = "The request could not be completed."
else:
text = resp.content[0].text
print(text)
9. Record what every call did
Operating an LLM feature means knowing, after the fact, what happened. For each call, log the model, the input and output token counts, the stop reason, the latency, and a request id. With that, a spike in cost is traceable to a specific prompt, a rise in truncated answers shows up as max_tokens stop reasons, and a latency regression is visible. None of this is exotic; it is the same operational hygiene as any external dependency, applied to one that happens to bill per token.
import time, logging
log = logging.getLogger("llm_app.model")
def ask_logged(client, prompt: str) -> str:
start = time.perf_counter()
resp = client.messages.create(
model="claude-opus-4-8", max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
ms = (time.perf_counter() - start) * 1000
u = resp.usage
log.info(f"model=claude-opus-4-8 in={u.input_tokens} out={u.output_tokens} "
f"stop={resp.stop_reason} ms={ms:.0f}")
return resp.content[0].text
Info
This client is a dependency like any other
By the end of the series this logged, retrying, budget-aware client is injected into FastAPI, mocked in tests, and shared across endpoints. Treating the model as a normal, well-instrumented dependency is what separates a demo from something you can run in production.
The bottom line
A model call is a network request with a token meter attached. Keep it behind one small client with timeout and retry handling, log usage so cost is visible, and you have a dependency you can inject into FastAPI and override in tests. The next problem is reliability of the output itself: getting the model to return structured data you can trust.
? Frequently asked questions
How do I keep keys out of code? +
Load them through the settings object from Part 5 and inject the client as a FastAPI dependency. Never hardcode a key.
Why is my output cut off? +
You hit max_tokens. Raise it, and for long outputs stream the response, which Part 11 covers.
Comments
0No comments yet. Be the first to share your thoughts.