Calling LLMs from Python (Series Part 8)

Part 8 begins the LLM half of the series. Calling a model is a network request like any other, but it has its own vocabulary: messages, tokens, cost, and rate limits. This part builds a small, well behaved client around the Anthropic Claude Python SDK, explains tokens and cost in concrete terms, and adds retries and timeouts so a flaky call does not take down your endpoint.

★

What you will learn

The messages request and response shape
What tokens are and how they drive cost
A reusable client with timeout and retry handling
Reading usage so you can track spend per request

Info

Setup

Add the SDK with uv add anthropic and set ANTHROPIC_API_KEY in your environment. The examples use the model claude-opus-4-8.

1. The request loop

A call is a list of messages in and one message out. You send a role and content, the model returns content blocks and a stop reason. This is the whole core; everything else is refinement.

from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize FastAPI in one sentence."}],
)

print(resp.content[0].text)
print("stop reason:", resp.stop_reason)

2. Tokens and cost

Models read and write tokens, not characters. A token is roughly three quarters of a word in English. You pay per input token and per output token, so cost scales with how much context you send and how much the model writes back. The response carries a usage object with the exact counts.

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=256,
    messages=[{"role": "user", "content": "List three uses for embeddings."}],
)
u = resp.usage
print(f"input tokens: {u.input_tokens}, output tokens: {u.output_tokens}")
# Multiply by the per-token price for each direction to get the cost.

💡

Pro tip

Log input and output tokens on every call. A single number per request makes it trivial to spot the prompt that quietly tripled your bill.

3. A client that survives the real world

Networks fail and APIs rate limit. The SDK retries transient errors automatically, and you can tune it. Wrap the call in your own thin function so timeouts, retries, and error handling live in one place that the rest of the app and your tests can rely on.

import anthropic
from anthropic import Anthropic

client = Anthropic(max_retries=4, timeout=30.0)  # SDK retries 429 and 5xx with backoff

def ask(prompt: str) -> str:
    try:
        resp = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )
        return resp.content[0].text
    except anthropic.RateLimitError:
        raise RuntimeError("rate limited; back off and retry later")
    except anthropic.APIError as e:
        raise RuntimeError(f"model call failed: {e}")

💡

Tip

Adaptive thinking

For harder prompts, pass thinking={"type": "adaptive"} and output_config={"effort": "high"}. The model decides how much to reason. Note that sampling parameters like temperature are not used on this model family.

Checkpoint

What mostly determines the cost of a single model call?

4. System prompts and message roles

Two levers shape a response. The system prompt sets the model role and standing rules for the whole conversation, and it is a separate parameter, not a message. The messages list carries the back and forth, where each entry has a role of user or assistant. Keep instructions that always apply in the system prompt and put the actual request in a user message; mixing the two makes prompts harder to reason about and to cache.

from anthropic import Anthropic

client = Anthropic()

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    system="You are a concise technical assistant. Answer in at most three sentences.",
    messages=[{"role": "user", "content": "What is an embedding?"}],
)
print(resp.content[0].text)

5. Multi-turn conversations are your job to track

The API is stateless. It does not remember the previous turn, so to hold a conversation you send the whole history every time: the prior user and assistant messages, then the new user message. This is liberating, because you control exactly what context the model sees, and it is also where cost creeps in, since a long history is re-sent on every turn. Part 12 uses this same message list as an agent memory.

messages = [
    {"role": "user", "content": "My name is Alex."},
    {"role": "assistant", "content": "Nice to meet you, Alex."},
    {"role": "user", "content": "What is my name?"},   # needs the history
]
resp = client.messages.create(model="claude-opus-4-8", max_tokens=128, messages=messages)
print(resp.content[0].text)   # answers correctly because history was sent

6. Inject the client as a dependency

Do not construct a client inside every handler. Create it once and hand it to routes through FastAPI dependency injection, exactly as you did for other shared resources in Part 5. This gives you one place to configure retries and timeouts, and, crucially, a seam your tests can override with a fake, which is what made the model mocking in Part 7 possible.

from functools import lru_cache
from anthropic import Anthropic
from fastapi import Depends

@lru_cache
def get_model_client() -> Anthropic:
    return Anthropic(max_retries=4, timeout=30.0)

async def summarize(text: str, client: Anthropic = Depends(get_model_client)) -> str:
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=256,
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    return resp.content[0].text

7. Budget tokens and guard cost

Two parameters bound spend. max_tokens caps how much the model can write in a single response, so set it to the real ceiling you need rather than a huge number. On the input side, the cost is whatever you send, so trim context, truncate overly long user input, and for long conversations summarize old turns instead of resending everything. Treat the usage numbers you log as a budget you actively manage, not a statistic you glance at.

def guarded_summary(client: Anthropic, text: str, max_input_chars: int = 8000) -> str:
    text = text[:max_input_chars]          # cap input you pay for
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=300,                    # cap output you pay for
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    u = resp.usage
    if u.input_tokens + u.output_tokens > 5000:
        # alert or sample-log unusually expensive calls
        ...
    return resp.content[0].text

💡

Tip

Pick the right model for the job

Not every call needs the most capable model. Route simple, high-volume tasks like classification to a smaller, cheaper model and reserve the strongest model for the hard reasoning. The client and message shape stay identical; only the model string changes.

8. Handle stop reasons and partial output

A response is not always complete just because the call returned. The stop reason tells you why the model stopped, and your code should branch on it. A normal finish is end_turn. If the model hit your max_tokens cap, the answer is cut off and you should raise the cap or stream. A refusal means the model declined for safety reasons, in which case the output may not be what you asked for, so do not blindly parse it as a clean result.

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=256,
    messages=[{"role": "user", "content": "Write a long essay on databases."}],
)

if resp.stop_reason == "max_tokens":
    # output is truncated; raise max_tokens or stream the response instead
    text = resp.content[0].text + " ... [truncated]"
elif resp.stop_reason == "refusal":
    text = "The request could not be completed."
else:
    text = resp.content[0].text
print(text)

9. Record what every call did

Operating an LLM feature means knowing, after the fact, what happened. For each call, log the model, the input and output token counts, the stop reason, the latency, and a request id. With that, a spike in cost is traceable to a specific prompt, a rise in truncated answers shows up as max_tokens stop reasons, and a latency regression is visible. None of this is exotic; it is the same operational hygiene as any external dependency, applied to one that happens to bill per token.

import time, logging

log = logging.getLogger("llm_app.model")

def ask_logged(client, prompt: str) -> str:
    start = time.perf_counter()
    resp = client.messages.create(
        model="claude-opus-4-8", max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    ms = (time.perf_counter() - start) * 1000
    u = resp.usage
    log.info(f"model=claude-opus-4-8 in={u.input_tokens} out={u.output_tokens} "
             f"stop={resp.stop_reason} ms={ms:.0f}")
    return resp.content[0].text

Info

This client is a dependency like any other

By the end of the series this logged, retrying, budget-aware client is injected into FastAPI, mocked in tests, and shared across endpoints. Treating the model as a normal, well-instrumented dependency is what separates a demo from something you can run in production.

The bottom line

A model call is a network request with a token meter attached. Keep it behind one small client with timeout and retry handling, log usage so cost is visible, and you have a dependency you can inject into FastAPI and override in tests. The next problem is reliability of the output itself: getting the model to return structured data you can trust.

? Frequently asked questions

How do I keep keys out of code? +

Load them through the settings object from Part 5 and inject the client as a FastAPI dependency. Never hardcode a key.

Why is my output cut off? +

You hit max_tokens. Raise it, and for long outputs stream the response, which Part 11 covers.

Up next: Part 9, structured outputs and function calling.

Calling LLMs from Python: The Request Loop, Tokens, Cost, and Retries

What you will learn

Setup

1. The request loop

2. Tokens and cost

3. A client that survives the real world

Adaptive thinking

4. System prompts and message roles

5. Multi-turn conversations are your job to track

6. Inject the client as a dependency

7. Budget tokens and guard cost

Pick the right model for the job

8. Handle stop reasons and partial output

9. Record what every call did

This client is a dependency like any other

The bottom line

? Frequently asked questions

Bishrul Haq

Tags

Share

Comments

Related posts

Structured Outputs and Function Calling: Getting Reliable JSON from LLMs

Building a RAG Service with FastAPI: Chunking, Embeddings, and Vector Search

Streaming LLM Responses to the Browser with FastAPI and SSE

Building an AI Agent API: Tool Calls, Memory, and Guardrails