Part 3 is where speed comes from. LLM and API calls spend almost all their time waiting on the network, which is exactly the case async Python is built for. This part explains the event loop without hand waving, shows how to fan out many calls at once with httpx and asyncio, and covers the mistakes that quietly make async code run serially.
What you will learn
- What async and await actually do, in plain terms
- Running many network calls concurrently with asyncio.gather
- A reusable async httpx client and why you keep it open
- Bounding concurrency so you do not overwhelm an API
1. Async in one idea
A coroutine is a function that can pause at an await point and let the event loop run something else while it waits. For work that is bound by waiting on I/O, that means one thread can keep dozens of calls in flight at once. It does not make CPU-bound code faster.
Three calls that each wait 0.3 seconds finish in about 0.3 seconds total, not 0.9, because they wait at the same time. Change gather to a plain loop with await inside and you would see them run one after another.
2. One client, many requests
httpx is an HTTP client with a first class async API. Create one AsyncClient and reuse it, because it pools connections. Creating a new client per request throws that away and is a common performance bug.
import asyncio
import httpx
async def get_json(client: httpx.AsyncClient, url: str) -> dict:
resp = await client.get(url, timeout=10.0)
resp.raise_for_status()
return resp.json()
async def main() -> None:
urls = [f"https://httpbin.org/anything/{i}" for i in range(5)]
async with httpx.AsyncClient() as client:
results = await asyncio.gather(*(get_json(client, u) for u in urls))
print(f"fetched {len(results)} responses")
asyncio.run(main())
3. Bounding concurrency
Firing a thousand requests at once is a good way to get rate limited or run out of sockets. A semaphore caps how many run at the same time while still keeping the pipeline full.
import asyncio
import httpx
async def limited_get(sem: asyncio.Semaphore, client: httpx.AsyncClient, url: str) -> int:
async with sem: # at most N in flight at once
resp = await client.get(url, timeout=10.0)
return resp.status_code
async def main() -> None:
sem = asyncio.Semaphore(8)
async with httpx.AsyncClient() as client:
urls = [f"https://httpbin.org/status/200?i={i}" for i in range(50)]
codes = await asyncio.gather(*(limited_get(sem, client, u) for u in urls))
print("all ok:", all(c == 200 for c in codes))
asyncio.run(main())
✓ Pros
- Reuse one AsyncClient across requests
- Use asyncio.gather to run independent calls together
- Set a timeout on every network call
✕ Cons
- Awaiting calls one by one in a loop runs them serially
- Calling a blocking library inside async stalls the loop
- Unbounded concurrency triggers rate limits and socket errors
Checkpoint
Why does async speed up LLM and API calls but not heavy math?
4. Timeouts and cancellation
A network call that never returns will hang a request forever. Always set a timeout, both on the client and, for whole operations, with asyncio.timeout. When the timeout fires, the coroutine is cancelled cleanly and you handle the error instead of leaking a stuck task. This matters even more for model calls, which can be slow, because a hung call ties up a worker that could be serving someone else.
import asyncio
import httpx
async def fetch_with_deadline(client: httpx.AsyncClient, url: str) -> int:
try:
async with asyncio.timeout(5.0): # whole-operation deadline
resp = await client.get(url)
return resp.status_code
except TimeoutError:
return 504 # treat a slow upstream as a gateway timeout
async def main() -> None:
async with httpx.AsyncClient() as client:
print(await fetch_with_deadline(client, "https://httpbin.org/delay/1"))
asyncio.run(main())
5. Reacting as results arrive
asyncio.gather waits for everything before giving you any result. Sometimes you want to handle each result the moment it is ready, for example to stream progress to a user or to stop early once you have enough. asyncio.as_completed yields tasks in completion order, not submission order.
import asyncio
async def work(name: str, delay: float) -> str:
await asyncio.sleep(delay)
return name
async def main() -> None:
tasks = [work("slow", 0.3), work("fast", 0.1), work("mid", 0.2)]
for coro in asyncio.as_completed(tasks):
result = await coro
print("ready:", result) # prints fast, mid, slow as they finish
asyncio.run(main())
6. Do not block the event loop
This is the mistake that quietly ruins async performance. The event loop runs on one thread, so any blocking call inside a coroutine, a synchronous library, a heavy computation, time.sleep, freezes every other task while it runs. The fix for a blocking library is asyncio.to_thread, which pushes the work onto a worker thread so the loop stays free to handle other requests.
import asyncio
def parse_large_file(path: str) -> int:
# a blocking, CPU-and-disk-bound function from a sync library
with open(path) as f:
return sum(1 for _ in f)
async def handler(path: str) -> int:
# run the blocking call off the event loop so the server stays responsive
return await asyncio.to_thread(parse_large_file, path)
Warning
One blocking call stalls everything
time.sleep, requests.get, and synchronous database drivers all block the loop. Inside async code use asyncio.sleep, an async client like httpx, and an async driver, or wrap the blocking call in asyncio.to_thread.
7. Structured concurrency with TaskGroup
Modern asyncio favors TaskGroup over loose gather calls. A task group scopes a set of tasks: if any task fails, the others are cancelled and the error propagates, so you never leave orphaned tasks running in the background. It is the safer default for launching several calls that should succeed or fail as a unit.
import asyncio
import httpx
async def main() -> None:
async with httpx.AsyncClient() as client:
async with asyncio.TaskGroup() as tg:
a = tg.create_task(client.get("https://httpbin.org/get"))
b = tg.create_task(client.get("https://httpbin.org/uuid"))
# both are done here; a failure in either cancels the other
print(a.result().status_code, b.result().status_code)
asyncio.run(main())
8. A reusable async client with retry and backoff
In a real app you do not scatter raw httpx calls everywhere. You wrap them in a small client so timeout, retry, and error handling live in one place that the rest of the code and your tests can rely on. Transient failures, a brief network blip or a rate limit, deserve a retry with exponential backoff, while a 404 should fail immediately. The class below encodes that policy once.
import asyncio
import httpx
class ApiClient:
def __init__(self, base_url: str, retries: int = 3) -> None:
self._client = httpx.AsyncClient(base_url=base_url, timeout=10.0)
self._retries = retries
async def get_json(self, path: str) -> dict:
for attempt in range(self._retries):
try:
resp = await self._client.get(path)
resp.raise_for_status()
return resp.json()
except httpx.HTTPStatusError as e:
if e.response.status_code < 500:
raise # client error: do not retry
await asyncio.sleep(2 ** attempt) # 1s, 2s, 4s backoff
except httpx.TransportError:
await asyncio.sleep(2 ** attempt)
raise RuntimeError(f"giving up on {path} after {self._retries} tries")
async def aclose(self) -> None:
await self._client.aclose()
Two details make this production friendly. It only retries server errors and transport failures, never a 4xx that will fail the same way every time, and it backs off so a struggling upstream is not hammered. The aclose method matters because the client holds a connection pool that should be closed on shutdown.
9. Lifecycle: open once, close cleanly
Where do you create and destroy that client in a long running server? Not per request, which throws away connection pooling, and not as a leaked global. FastAPI gives you a lifespan hook that runs on startup and shutdown, the natural place to open shared clients and close them. You will see this exact pattern again when you wire the model client into the app.
from contextlib import asynccontextmanager
from fastapi import FastAPI
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.api = ApiClient(base_url="https://api.example.com")
yield # app runs here
await app.state.api.aclose() # clean shutdown
app = FastAPI(lifespan=lifespan)
Pro tip
A good async client is boring on purpose: one place for timeouts, one retry policy, one connection pool, and a clean shutdown. Boring here means no mystery latency, no socket exhaustion, and tests that can swap the whole client for a fake.
The bottom line
Async Python is the right tool when your program spends its time waiting on the network, which describes almost every LLM and API workload. Reuse one client, fan out with gather, and bound concurrency with a semaphore. FastAPI is async first, so the next part puts these exact patterns to work inside real endpoints.
? Frequently asked questions
Should every function be async? +
No. Make functions async when they perform awaitable I/O. Mixing a blocking call into an async function stalls the whole event loop.
How do I run blocking code from async? +
Use asyncio.to_thread to push a blocking call onto a worker thread so the event loop stays responsive.
Up next: Part 4, FastAPI fundamentals.
Comments
0No comments yet. Be the first to share your thoughts.