Production AI App Patterns

Streaming LLM Responses: SSE vs WebSockets

If you are building a chat interface on top of GPT, Claude, or any other large language model, you will eventually face a transport decision: should you stream tokens over Server-Sent Events (SSE) or WebSockets? This guide is for backend and full-stack engineers who already have a working LLM call and now need to deliver those tokens to the browser without a multi-second wait. Streaming LLM responses is what turns a sluggish “loading…” spinner into the familiar typewriter effect users expect, so picking the right channel matters more than it first appears.

Both options work. However, they pull your architecture in different directions. SSE is a one-way HTTP stream that piggybacks on infrastructure you already run. WebSockets open a persistent two-way connection that unlocks richer interaction at the cost of more moving parts. By the end of this comparison, you will know which transport fits your app, see production-grade server code for each, and recognize the mistakes that bite teams in production.

Why Streaming LLM Responses Matters

Streaming LLM responses means sending each token to the client as the model generates it, instead of buffering the full answer and returning it in one response. A typical model emits tokens over several seconds. Without streaming, the user stares at a blank screen the entire time. With streaming, the first words appear in a few hundred milliseconds, which dramatically lowers perceived latency and keeps people engaged.

The actual generation speed does not change. What changes is the experience. Time-to-first-token becomes the metric users feel, not total completion time. For a deeper look at the client side of this pattern, see our guide on building an AI chatbot with streaming responses. The transport you choose simply decides how those tokens travel from your server to the browser.

What Is Server-Sent Events (SSE)?

Server-Sent Events is a standard browser API for receiving a one-way stream of text events over a single long-lived HTTP connection. The server sets a text/event-stream content type and writes messages in a simple data: format. The browser consumes them through the native EventSource object, which also handles automatic reconnection.

SSE rides entirely on HTTP. There is no protocol upgrade, no special handshake, and no separate port. Because it is plain HTTP, it works through most proxies, CDNs, and load balancers without extra configuration. That simplicity is exactly why nearly every major LLM provider, including OpenAI and Anthropic, ships SSE as the default streaming format for their own APIs.

The trade-off is direction. SSE flows from server to client only. To send a new prompt, the client makes a separate ordinary HTTP request. For most chat apps, this is fine, since user messages are infrequent compared to the flood of tokens coming back.

What Are WebSockets?

WebSockets provide a full-duplex connection between client and server over a single TCP socket. After an initial HTTP handshake that upgrades the connection, both sides can send messages at any time with very low overhead. Unlike SSE, WebSockets carry binary or text frames in both directions, which makes them the natural choice for genuinely interactive, bidirectional workloads.

For LLM apps, that bidirectional power becomes valuable when the user needs to interrupt generation, stream microphone audio for a voice agent, or receive live tool-call updates while the model is still thinking. If your app already relies on WebSockets for other real-time features, reusing that channel for token streaming avoids running two systems side by side. Our Node.js WebSocket chat server walkthrough covers the connection lifecycle in detail.

The cost is operational. WebSockets are stateful connections that your infrastructure must hold open, route correctly, and clean up. That changes how you scale, as the later sections explain.

SSE vs WebSockets: Key Differences

The table below summarizes the practical differences that matter when streaming LLM responses.

FactorServer-Sent Events (SSE)WebSockets
DirectionOne-way (server to client)Two-way (full duplex)
ProtocolPlain HTTPTCP after HTTP upgrade
Browser APINative EventSourceNative WebSocket
Auto-reconnectBuilt inManual
Custom auth headersNot supported in browserSupported
Proxy / CDN friendlinessHighSometimes needs config
Provider defaultYes (OpenAI, Anthropic)No
Best forChat completions, one prompt at a timeVoice, interrupts, live tool updates

The headline takeaway is that SSE optimizes for simplicity and infrastructure reuse, whereas WebSockets optimize for interactivity. Most text chat apps lean SSE. Apps with audio, interrupts, or dense back-and-forth lean WebSockets.

How SSE Streaming Works With an LLM

On the server, you open the SSE stream, iterate the model’s token stream, and write each token as a data: event. The example below uses Express and the OpenAI SDK, but the same pattern applies to FastAPI, Spring, or any framework that lets you flush a response incrementally.

import express from "express";
import OpenAI from "openai";

const app = express();
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

app.get("/chat", async (req, res) => {
  // SSE requires these headers and an unbuffered connection
  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    Connection: "keep-alive",
  });

  const prompt = String(req.query.prompt ?? "");

  try {
    const stream = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: prompt }],
      stream: true,
    });

    for await (const chunk of stream) {
      const token = chunk.choices[0]?.delta?.content;
      if (token) {
        // Each SSE message ends with a double newline
        res.write(`data: ${JSON.stringify({ token })}\n\n`);
      }
    }
    res.write("data: [DONE]\n\n");
  } catch (err) {
    // Surface a structured error event instead of a silent hang
    res.write(`event: error\ndata: ${JSON.stringify({ message: "stream failed" })}\n\n`);
  } finally {
    res.end();
  }
});

app.listen(3000);

The \n\n after each message is not optional. SSE uses the blank line as a record delimiter, so a single missing newline breaks the whole stream. The [DONE] sentinel tells the client to close cleanly rather than wait for a timeout.

On the client, the native EventSource API does the heavy lifting, including reconnection if the connection drops mid-stream.

const source = new EventSource(
  `/chat?prompt=${encodeURIComponent(userInput)}`
);

source.onmessage = (event) => {
  if (event.data === "[DONE]") {
    source.close();
    return;
  }
  const { token } = JSON.parse(event.data);
  outputEl.textContent += token; // append the typewriter effect
};

source.onerror = () => {
  // EventSource auto-retries, so only close on a terminal failure
  source.close();
};

Notice one real limitation: EventSource only issues GET requests and cannot set custom headers, so you cannot attach a Bearer token directly. In practice, teams work around this with a cookie-based session or a short-lived token in the query string. If you want a higher-level abstraction that hides this plumbing, the Vercel AI SDK streaming chat setup wraps SSE for Next.js apps.

How WebSocket Streaming Works With an LLM

With WebSockets, the client opens a connection once and then sends prompts as messages. The server streams tokens back as discrete frames. This example uses the ws library in Node.

import { WebSocketServer } from "ws";
import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const wss = new WebSocketServer({ port: 3001 });

wss.on("connection", (socket) => {
  socket.on("message", async (raw) => {
    let prompt;
    try {
      prompt = JSON.parse(raw.toString()).prompt;
    } catch {
      socket.send(JSON.stringify({ type: "error", message: "invalid payload" }));
      return;
    }

    try {
      const stream = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        stream: true,
      });

      for await (const chunk of stream) {
        const token = chunk.choices[0]?.delta?.content;
        // Guard against writing to a connection the user already closed
        if (token && socket.readyState === socket.OPEN) {
          socket.send(JSON.stringify({ type: "token", token }));
        }
      }
      socket.send(JSON.stringify({ type: "done" }));
    } catch {
      socket.send(JSON.stringify({ type: "error", message: "stream failed" }));
    }
  });
});

The readyState check matters. Because the connection is persistent, a user can close the tab mid-generation, and writing to a dead socket throws. With this design, the same connection can also carry an interrupt message: the client sends { "type": "cancel" }, and your handler aborts the OpenAI stream. That interrupt capability is something SSE simply cannot offer over one channel.

On the client, the WebSocket API is equally direct, but you own the reconnection logic yourself.

const ws = new WebSocket("wss://api.example.com/chat");

ws.onopen = () => ws.send(JSON.stringify({ prompt: userInput }));

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === "token") {
    outputEl.textContent += msg.token;
  } else if (msg.type === "done") {
    // ready for the next prompt on the same connection
  }
};

Latency, Scaling, and Infrastructure Considerations

For raw time-to-first-token, the two transports are effectively identical. Both deliver tokens as fast as your model and network allow, and the per-message overhead difference is negligible against multi-hundred-millisecond model latency. So latency is rarely the deciding factor.

Scaling is where they diverge. SSE connections are still HTTP requests, so they fit your existing autoscaling, load balancers, and stateless server model with no special handling. WebSockets are sticky, stateful connections that pin a user to one server instance for the connection’s lifetime. Consequently, you often need sticky sessions, a higher per-instance connection budget, and a shared layer like Redis pub/sub to broadcast across instances. Our guide on real-time notifications with Socket.IO and Redis shows that fan-out pattern.

One more operational wrinkle deserves attention. Many reverse proxies and platforms buffer responses by default, which silently defeats SSE by holding tokens until the response ends. You usually fix this by disabling buffering, for example with the X-Accel-Buffering: no header on Nginx. WebSockets sidestep buffering but introduce their own requirement: idle connections need periodic ping frames so proxies do not cut them after a timeout.

When to Use SSE for Streaming LLM Responses

  • You are building a standard text chat where the user sends one prompt and reads one streamed answer
  • You want to reuse existing HTTP infrastructure, autoscaling, and stateless servers without new ops work
  • You value built-in automatic reconnection and the simplest possible client code
  • Your LLM provider already returns SSE, so you can proxy or transform it with minimal translation
  • You need streaming to pass cleanly through CDNs and corporate proxies

When to Use WebSockets for Streaming LLM Responses

  • Users must interrupt or cancel generation over the same channel they receive tokens on
  • You are streaming audio for a voice agent or any genuinely bidirectional, low-latency exchange
  • Your app already runs a WebSocket layer for presence, collaboration, or live updates
  • You want to push tool-call progress, status events, and tokens through one multiplexed connection
  • You need to send frequent client-to-server messages, not just the occasional new prompt

Common Mistakes With LLM Streaming

  • Forgetting the double newline (\n\n) between SSE messages, which makes the browser ignore every event
  • Letting a proxy or platform buffer the response, so tokens arrive all at once instead of incrementally
  • Writing tokens to a WebSocket without checking readyState, which crashes the handler when a user leaves
  • Choosing WebSockets for a plain chat app and inheriting sticky sessions and Redis fan-out you never needed
  • Skipping a [DONE] sentinel or done event, leaving clients hanging until a timeout
  • Stuffing a long-lived API key into a query string because EventSource cannot set headers, instead of using a short-lived session token

Conclusion

For most apps, streaming LLM responses over SSE is the right default. It matches how the model providers already stream, reuses your HTTP stack, and ships with reconnection handled for you. Reach for WebSockets when your app is genuinely interactive, such as voice agents, mid-stream interrupts, or dense bidirectional traffic, where the extra operational cost buys real capability.

Start by shipping SSE, measure your time-to-first-token, and only migrate to WebSockets when a concrete feature demands two-way communication. To keep building, explore the broader transport trade-offs in our guide on real-time APIs with WebSockets and Server-Sent Events, then wire your stream to a fast backend with the Groq API for the fastest LLM inference.

Leave a Comment