WebSocket overview

WS /v1/tts/ws is the recommended transport for interactive applications — voice agents, dialog systems, anything where text arrives incrementally over the lifetime of a session. A single WebSocket connection multiplexes multiple contexts. Each context represents one continuous utterance with a chosen voice and format. You can have many open contexts on one connection at once; the server tags every frame with its context_id so you can route audio back to the right player.

┌────────────────────────────────┐
│            client              │
└──────────────┬─────────────────┘
               │  WS (one connection, x-api-key header)
┌──────────────▼─────────────────┐
│   ┌─ context "ctx-greet"      │
│   ├─ context "ctx-confirm"    │
│   └─ context "ctx-followup"   │
│                                │
│       Kova TTS WS endpoint     │
└────────────────────────────────┘

When to use WebSocket vs streaming HTTP

Need	Use
One utterance, low-latency playback	Streaming HTTP — simpler, fewer moving parts
Many utterances over a session, want connection reuse	WebSocket
Multiple parallel utterances (e.g. background music + dialog)	WebSocket with multiple contexts
Browser client	Streaming HTTP — browsers can’t set custom WS headers reliably

Authentication

The x-api-key header is sent on the WebSocket handshake (not as the first frame after connect):

GET /v1/tts/ws HTTP/1.1
Upgrade: websocket
Host: api.kova.ai
x-api-key: kova_sk_...

Browsers cannot reliably set custom headers on a WebSocket constructor. Use Streaming HTTP for browser clients, or proxy the WebSocket through a Node backend that sets the header.

Connection lifecycle

Connect

Open a WebSocket to wss://api.kova.ai/v1/tts/ws with x-api-key in the handshake.

Start a context

Send a start_context frame with the voice, model, and (optionally) timestamps + response_format. The server replies with context_started.

Send text

Send one or more send_text frames. The server generates audio incrementally and emits audio_chunk frames (plus timestamps if enabled).

Flush

Send a flush frame to mark the end of the current utterance. The server finishes any in-progress generation and emits flush_completed.

Close the context

Send close_context to release server resources for that context. The server emits context_closed. The WebSocket itself stays open.

Disconnect

Close the WebSocket when you’re done. Or keep it open for further contexts.

Audio format

WebSocket audio_chunk values are base64-encoded little-endian 16-bit PCM at 32 kHz mono by default. You can override per-context by passing response_format to start_context — same shape as the HTTP request.

Frame reference — every client + server frame with shapes.
Example: streaming a long document — concrete end-to-end walkthrough.

​When to use WebSocket vs streaming HTTP

​Authentication

​Connection lifecycle

​Audio format

​Next

When to use WebSocket vs streaming HTTP

Authentication

Connection lifecycle

Audio format

Next