Cover image for Calling LLMs from Python: The Request Loop, Tokens, Cost, and Retries

At a glance

Reading time

~200 words/min

Published

5 hours ago

Jun 10, 2026

Views

4

All-time total

Calling LLMs from Python: The Request Loop, Tokens, Cost, and Retries

Part 8 begins the LLM half of the series. Calling a model is a network request like any other, but it has its own vocabulary: messages, tokens, cost, and rate limits. This part builds a small, well behaved client around the Anthropic Claude Python SDK, explains tokens and cost in concrete terms, and adds retries and timeouts so a flaky call does not take down your endpoint.

What you will learn

  • The messages request and response shape
  • What tokens are and how they drive cost
  • A reusable client with timeout and retry handling
  • Reading usage so you can track spend per request
i

Info

Setup

Add the SDK with uv add anthropic and set ANTHROPIC_API_KEY in your environment. The examples use the model claude-opus-4-8.

1. The request loop

A call is a list of messages in and one message out. You send a role and content, the model returns content blocks and a stop reason. This is the whole core; everything else is refinement.

from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize FastAPI in one sentence."}],
)

print(resp.content[0].text)
print("stop reason:", resp.stop_reason)

2. Tokens and cost

Models read and write tokens, not characters. A token is roughly three quarters of a word in English. You pay per input token and per output token, so cost scales with how much context you send and how much the model writes back. The response carries a usage object with the exact counts.

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=256,
    messages=[{"role": "user", "content": "List three uses for embeddings."}],
)
u = resp.usage
print(f"input tokens: {u.input_tokens}, output tokens: {u.output_tokens}")
# Multiply by the per-token price for each direction to get the cost.
💡

Pro tip

Log input and output tokens on every call. A single number per request makes it trivial to spot the prompt that quietly tripled your bill.

3. A client that survives the real world

Networks fail and APIs rate limit. The SDK retries transient errors automatically, and you can tune it. Wrap the call in your own thin function so timeouts, retries, and error handling live in one place that the rest of the app and your tests can rely on.

import anthropic
from anthropic import Anthropic

client = Anthropic(max_retries=4, timeout=30.0)  # SDK retries 429 and 5xx with backoff

def ask(prompt: str) -> str:
    try:
        resp = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )
        return resp.content[0].text
    except anthropic.RateLimitError:
        raise RuntimeError("rate limited; back off and retry later")
    except anthropic.APIError as e:
        raise RuntimeError(f"model call failed: {e}")
💡

Tip

Adaptive thinking

For harder prompts, pass thinking={"type": "adaptive"} and output_config={"effort": "high"}. The model decides how much to reason. Note that sampling parameters like temperature are not used on this model family.

Checkpoint

What mostly determines the cost of a single model call?

4. System prompts and message roles

Two levers shape a response. The system prompt sets the model role and standing rules for the whole conversation, and it is a separate parameter, not a message. The messages list carries the back and forth, where each entry has a role of user or assistant. Keep instructions that always apply in the system prompt and put the actual request in a user message; mixing the two makes prompts harder to reason about and to cache.

from anthropic import Anthropic

client = Anthropic()

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    system="You are a concise technical assistant. Answer in at most three sentences.",
    messages=[{"role": "user", "content": "What is an embedding?"}],
)
print(resp.content[0].text)

5. Multi-turn conversations are your job to track

The API is stateless. It does not remember the previous turn, so to hold a conversation you send the whole history every time: the prior user and assistant messages, then the new user message. This is liberating, because you control exactly what context the model sees, and it is also where cost creeps in, since a long history is re-sent on every turn. Part 12 uses this same message list as an agent memory.

messages = [
    {"role": "user", "content": "My name is Alex."},
    {"role": "assistant", "content": "Nice to meet you, Alex."},
    {"role": "user", "content": "What is my name?"},   # needs the history
]
resp = client.messages.create(model="claude-opus-4-8", max_tokens=128, messages=messages)
print(resp.content[0].text)   # answers correctly because history was sent

6. Inject the client as a dependency

Do not construct a client inside every handler. Create it once and hand it to routes through FastAPI dependency injection, exactly as you did for other shared resources in Part 5. This gives you one place to configure retries and timeouts, and, crucially, a seam your tests can override with a fake, which is what made the model mocking in Part 7 possible.

from functools import lru_cache
from anthropic import Anthropic
from fastapi import Depends

@lru_cache
def get_model_client() -> Anthropic:
    return Anthropic(max_retries=4, timeout=30.0)

async def summarize(text: str, client: Anthropic = Depends(get_model_client)) -> str:
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=256,
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    return resp.content[0].text

7. Budget tokens and guard cost

Two parameters bound spend. max_tokens caps how much the model can write in a single response, so set it to the real ceiling you need rather than a huge number. On the input side, the cost is whatever you send, so trim context, truncate overly long user input, and for long conversations summarize old turns instead of resending everything. Treat the usage numbers you log as a budget you actively manage, not a statistic you glance at.

def guarded_summary(client: Anthropic, text: str, max_input_chars: int = 8000) -> str:
    text = text[:max_input_chars]          # cap input you pay for
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=300,                    # cap output you pay for
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
    )
    u = resp.usage
    if u.input_tokens + u.output_tokens > 5000:
        # alert or sample-log unusually expensive calls
        ...
    return resp.content[0].text
💡

Tip

Pick the right model for the job

Not every call needs the most capable model. Route simple, high-volume tasks like classification to a smaller, cheaper model and reserve the strongest model for the hard reasoning. The client and message shape stay identical; only the model string changes.

8. Handle stop reasons and partial output

A response is not always complete just because the call returned. The stop reason tells you why the model stopped, and your code should branch on it. A normal finish is end_turn. If the model hit your max_tokens cap, the answer is cut off and you should raise the cap or stream. A refusal means the model declined for safety reasons, in which case the output may not be what you asked for, so do not blindly parse it as a clean result.

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=256,
    messages=[{"role": "user", "content": "Write a long essay on databases."}],
)

if resp.stop_reason == "max_tokens":
    # output is truncated; raise max_tokens or stream the response instead
    text = resp.content[0].text + " ... [truncated]"
elif resp.stop_reason == "refusal":
    text = "The request could not be completed."
else:
    text = resp.content[0].text
print(text)

9. Record what every call did

Operating an LLM feature means knowing, after the fact, what happened. For each call, log the model, the input and output token counts, the stop reason, the latency, and a request id. With that, a spike in cost is traceable to a specific prompt, a rise in truncated answers shows up as max_tokens stop reasons, and a latency regression is visible. None of this is exotic; it is the same operational hygiene as any external dependency, applied to one that happens to bill per token.

import time, logging

log = logging.getLogger("llm_app.model")

def ask_logged(client, prompt: str) -> str:
    start = time.perf_counter()
    resp = client.messages.create(
        model="claude-opus-4-8", max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    ms = (time.perf_counter() - start) * 1000
    u = resp.usage
    log.info(f"model=claude-opus-4-8 in={u.input_tokens} out={u.output_tokens} "
             f"stop={resp.stop_reason} ms={ms:.0f}")
    return resp.content[0].text
i

Info

This client is a dependency like any other

By the end of the series this logged, retrying, budget-aware client is injected into FastAPI, mocked in tests, and shared across endpoints. Treating the model as a normal, well-instrumented dependency is what separates a demo from something you can run in production.

The bottom line

A model call is a network request with a token meter attached. Keep it behind one small client with timeout and retry handling, log usage so cost is visible, and you have a dependency you can inject into FastAPI and override in tests. The next problem is reliability of the output itself: getting the model to return structured data you can trust.

? Frequently asked questions

How do I keep keys out of code? +

Load them through the settings object from Part 5 and inject the client as a FastAPI dependency. Never hardcode a key.

Why is my output cut off? +

You hit max_tokens. Raise it, and for long outputs stream the response, which Part 11 covers.

Up next: Part 9, structured outputs and function calling.

Newsletter

Want more posts like this?

Get practical software notes and tutorials delivered when something new is published.

No spam. Unsubscribe anytime.

How did this land?

Comments

0
Log in or sign up to join the discussion and react to this post.

No comments yet. Be the first to share your thoughts.

Related posts

Structured Outputs and Function Calling: Getting Reliable JSON from LLMs

Get dependable JSON from language models with structured outputs and function calling, then validate with Pydantic so your code works with typed objects.

5 hours ago

Building a RAG Service with FastAPI: Chunking, Embeddings, and Vector Search

Build a RAG service end to end: chunk documents, embed and search by similarity, and answer grounded in retrieved context from a FastAPI endpoint.

5 hours ago

Streaming LLM Responses to the Browser with FastAPI and SSE

Stream a real model response to the browser: consume the model stream in Python, forward it through a FastAPI SSE endpoint, and render it live.

5 hours ago

Building an AI Agent API: Tool Calls, Memory, and Guardrails

Build a small AI agent API: the tool calling loop, conversation memory, and the guardrails that keep an action taking agent safe and bounded.

5 hours ago