An LLM that takes 12 seconds to respond feels broken. The same response, streamed token-by-token, feels instant. Streaming is no longer a nice-to-have it is the baseline UX for any AI feature that talks to users. The good news is that Laravel 12 ships every primitive you need: HTTP streaming responses, queued jobs, generator-friendly controllers, and Laravel Reverb for first-class WebSockets.
This guide builds a production-grade streaming chat: SSE for the simple case, Reverb for multi-tab and multi-device sync.We will use Claude as the model, but the pattern works for any provider that supports server-sent token streams OpenAI, Mistral, Gemini, even self-hosted Ollama. The core insight: streaming is not a feature you bolt on at the end. It changes how you structure the controller, the queue, and the frontend reducer all at once.
Why streaming matters
Three reasons, in order of importance:
- Perceived latency drops to near zero. Users start reading at 300ms instead of waiting 8 seconds for a full reply. Total wall-clock time barely changes; the experience changes completely.
- Abandonment goes down. Without streaming, users hit refresh, type again, or leave. With streaming, they wait because they can see progress.
- You catch problems earlier. If the model starts answering the wrong question, the user can stop the stream at token 50 instead of waiting for token 800.
Three transports : when to pick which
| Transport | Direction | Best for | Avoid when |
|---|---|---|---|
| Polling | Pull | Background jobs, status updates | You need sub second latency |
| SSE | Server → client | Single-user AI chat, live logs | Bidirectional or multi-consumer fan-out |
| WebSockets (Reverb) | Bidirectional | Shared conversations, collaborative editing | A single tab is the only consumer |
Two streaming modes, one backend
- SSE (Server-Sent Events) : perfect when one browser tab is the consumer. Simple, no auth handshake, works with vanilla fetch, automatic reconnection in
EventSource(we will use fetch instead because EventSource is GET-only). - Reverb / WebSockets : required when the same conversation needs to stream to multiple devices, dashboards, or shared inboxes. Also the right call when the agent itself runs in a queue worker (the worker cannot hold an HTTP response open).
Step 1 : Streaming SSE endpoint
Laravel's stream() response is the cleanest way to push tokens. It flushes the buffer per chunk, so the browser sees text as it arrives.
// routes/web.php
Route::post('/chat/stream', StreamChatController::class);
// app/Http/Controllers/StreamChatController.php
namespace App\Http\Controllers;
use App\Services\Ai\ClaudeStreamer;
use Illuminate\Http\Request;
use Symfony\Component\HttpFoundation\StreamedResponse;
class StreamChatController extends Controller
{
public function __invoke(Request $request, ClaudeStreamer $streamer): StreamedResponse
{
$data = $request->validate([
'messages' => ['required', 'array', 'max:50'],
'messages.*.role' => ['required', 'in:user,assistant'],
'messages.*.content' => ['required', 'string', 'max:4000'],
]);
return response()->stream(function () use ($streamer, $data) {
foreach ($streamer->stream($data['messages']) as $event) {
echo "event: {$event['type']}\n";
echo 'data: ' . json_encode($event['payload']) . "\n\n";
if (ob_get_level() > 0) ob_flush();
flush();
}
}, 200, [
'Content-Type' => 'text/event-stream',
'Cache-Control' => 'no-cache',
'X-Accel-Buffering' => 'no',
]);
}
}
The X-Accel-Buffering header is critical when you sit behind Nginx — without it the proxy buffers the response and the user sees nothing until the model finishes.
Step 2 : The streamer service
This is a generator. It yields events as they come from the Anthropic SDK; the controller decides what to do with them.
// app/Services/Ai/ClaudeStreamer.php
namespace App\Services\Ai;
use Anthropic\Anthropic;
use Generator;
class ClaudeStreamer
{
public function __construct(private Anthropic $client) {}
public function stream(array $messages): Generator
{
$stream = $this->client->messages->createStreamed([
'model' => 'claude-sonnet-4-6',
'max_tokens' => 1024,
'system' => 'You are a concise assistant. Use short paragraphs.',
'messages' => $messages,
]);
foreach ($stream as $event) {
if ($event->type === 'content_block_delta' && $event->delta?->type === 'text_delta') {
yield ['type' => 'token', 'payload' => ['text' => $event->delta->text]];
}
if ($event->type === 'message_stop') {
yield ['type' => 'done', 'payload' => ['usage' => $event->usage ?? null]];
}
}
}
}
Step 3 : Browser consumer
Use the Fetch streaming API. EventSource only supports GET, which is wrong for chat POST lets you send a message body cleanly.
// resources/js/chat.js
async function streamChat(messages, onToken) {
const res = await fetch('/chat/stream', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-CSRF-TOKEN': window.csrfToken,
},
body: JSON.stringify({ messages }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
let sep;
while ((sep = buffer.indexOf('\n\n')) !== -1) {
const chunk = buffer.slice(0, sep);
buffer = buffer.slice(sep + 2);
const event = parseSse(chunk);
if (event.type === 'token') onToken(event.data.text);
}
}
}
function parseSse(chunk) {
const lines = chunk.split('\n');
let type = 'message', data = {};
for (const line of lines) {
if (line.startsWith('event:')) type = line.slice(6).trim();
if (line.startsWith('data:')) data = JSON.parse(line.slice(5).trim());
}
return { type, data };
}
Step 4 : Reverb for shared conversations
Once you need the same stream to land on multiple clients (think: customer in browser + agent in admin panel), switch to broadcast events. A queued job runs the LLM call and emits each token over a private channel.
composer require laravel/reverb
php artisan reverb:install
php artisan reverb:start
// app/Events/AiTokenStreamed.php
namespace App\Events;
use Illuminate\Broadcasting\Channel;
use Illuminate\Broadcasting\InteractsWithSockets;
use Illuminate\Broadcasting\PrivateChannel;
use Illuminate\Contracts\Broadcasting\ShouldBroadcast;
use Illuminate\Queue\SerializesModels;
class AiTokenStreamed implements ShouldBroadcast
{
use InteractsWithSockets, SerializesModels;
public function __construct(
public int $conversationId,
public string $token,
public string $eventType = 'token',
) {}
public function broadcastOn(): Channel
{
return new PrivateChannel("conversations.{$this->conversationId}");
}
public function broadcastAs(): string
{
return $this->eventType;
}
}
// app/Jobs/StreamAiReply.php
public function handle(ClaudeStreamer $streamer): void
{
foreach ($streamer->stream($this->messages) as $event) {
broadcast(new AiTokenStreamed(
conversationId: $this->conversationId,
token: $event['payload']['text'] ?? '',
eventType: $event['type'],
));
}
}
// routes/channels.php
Broadcast::channel('conversations.{id}', function (User $user, int $id) {
return $user->canAccessConversation($id);
});
Step 5 : Echo client for Reverb
// resources/js/echo.js
import Echo from 'laravel-echo';
import Pusher from 'pusher-js';
window.Pusher = Pusher;
window.Echo = new Echo({
broadcaster: 'reverb',
key: import.meta.env.VITE_REVERB_APP_KEY,
wsHost: import.meta.env.VITE_REVERB_HOST,
wsPort: import.meta.env.VITE_REVERB_PORT,
forceTLS: false,
enabledTransports: ['ws', 'wss'],
});
Echo.private(`conversations.${conversationId}`)
.listen('.token', (e) => appendToken(e.token))
.listen('.done', () => finalizeMessage());
Proxy & server configuration
Streaming gets sabotaged at the infra layer more often than at the application layer. Configure these before you debug your code:
| Layer | Setting | Why |
|---|---|---|
| PHP-FPM | output_buffering=Off |
Without this, PHP collects the whole response before flushing |
| Nginx | proxy_buffering off; proxy_read_timeout 180s; |
Default is 60s; LLMs routinely exceed that |
| Cloudflare | Disable "Auto Minify" on the chat path | Minification batches output and breaks SSE framing |
| CDN | Bypass cache for /chat/stream |
Streaming responses must never be cached |
| Browser | Cache-Control: no-cache, no-transform |
Some intermediaries gzip and buffer otherwise |
Reconnection & cancellation
Real chat sessions break. The user closes the lid, the train enters a tunnel, the worker restarts mid-stream. Build for it:
- Persist partial replies. Save the assistant message to the DB when the stream starts, then update on each
message_stop. Reload it if the user reconnects. - Server-side cancellation flag. Store a Redis key like
cancel:{conversationId}; the streamer checks it between deltas and stops cleanly. - Client-side AbortController. When the user types a new message, abort the inflight fetch — otherwise tokens from the old reply will keep arriving.
- Deduplicate by event ID. If you reconnect and the server replays from buffer, the client should skip events it has already rendered.
// Client cancellation
let controller = null;
function send(messages) {
controller?.abort();
controller = new AbortController();
fetch('/chat/stream', { signal: controller.signal, /* ... */ });
}
Production gotchas
- Disable PHP-FPM output buffering. Set
output_buffering=Offfor the streaming worker pool. - Use a long timeout. A 30-second LLM call dies behind default proxy timeouts. Set
proxy_read_timeoutto at least 180 seconds. - Send keep-alive comments. If the model thinks for >15 seconds before the first token, push
: ping\n\nevery 10 seconds to keep proxies happy. - Persist on done, not on token. Buffer the assistant message in memory and write it to the DB once when the stream ends : DB writes per token will melt your database.
- Cancel inflight requests. When the user sends a new message before the previous one finishes, the client should abort the fetch and the job should detect a cancellation flag.
- Rate limit by user. A user holding the same SSE response open for hours is one socket per session. A throttle middleware on
/chat/streamprotects you from runaway tab-loops. - Fan-out via Redis when scaling Reverb. Multiple Reverb processes need a shared backplane to broadcast across them.
Streaming is not about speed. It is about turning waiting into reading. The total time barely changes; the perceived latency drops to zero.
Start with SSE. Move to Reverb only when you need fan-out. Both pipelines share the same generator service, so the migration is a single line of code.