Skip to main content
POST
/
v1
/
tts
/
stream
Streaming text to speech
curl --request POST \
  --url https://api.kova.ai/v1/tts/stream \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "text": "<string>",
  "voice": "<string>",
  "normalize_text": false,
  "response_format": {
    "bitrate": "<string>",
    "encoding": "mp3",
    "sample_rate": 123
  },
  "temperature": 123,
  "timestamps": false
}
'
{
  "detail": [
    {
      "loc": [
        "<string>"
      ],
      "msg": "<string>",
      "type": "<string>",
      "ctx": {},
      "input": "<unknown>"
    }
  ]
}

Documentation Index

Fetch the complete documentation index at: https://docs.kova.ai/llms.txt

Use this file to discover all available pages before exploring further.

POST /v1/tts/stream returns a text/plain body of SSE-style records — one JSON object per data: line, separated by blank lines. The request body is identical to Text to speech. Use streaming when you want to start playback before generation finishes.
The default response encoding is mp3, not raw PCM. Each audio_chunk value is base64-encoded audio in your chosen response_format; concatenate the chunks to reconstruct the file.

Event stream format

The response body is a sequence of records:
data: {"type":"audio","audio_chunk":"<base64>"}

data: {"type":"audio","audio_chunk":"<base64>"}

data: {"type":"timestamps","words":["hello"],"start_seconds":[0.0],"end_seconds":[0.3]}

data: {"type":"audio","audio_chunk":"<base64>"}
Each record:
  1. Starts with data: (note the space).
  2. Contains one JSON object.
  3. Ends with \n\n (two newlines).
There is no terminating data: [DONE] — the stream ends when the HTTP body ends.

Event types

audio

type AudioEvent = {
  type: "audio";
  audio_chunk: string;  // base64-encoded audio bytes in your response_format
};

timestamps (only when timestamps: true)

type TimestampsEvent = {
  type: "timestamps";
  words: string[];
  start_seconds: number[];
  end_seconds: number[];
};
The three arrays are parallel — words[i] starts at start_seconds[i] and ends at end_seconds[i].

Examples

import asyncio, os
from kova_tts import KovaTTSClient, AudioResponseFormat

async def main():
    client = KovaTTSClient(api_key=os.environ["KOVA_API_KEY"])
    audio_bytes = bytearray()
    async for event in client.stream_tts(
        text="This is streaming TTS from Kova.",
        voice="cal",
        response_format=AudioResponseFormat(encoding="mp3"),
        timestamps=True,
    ):
        if event.type == "audio":
            audio_bytes.extend(event.audio)
            print(f"audio: {len(event.audio)} bytes")
        elif event.type == "timestamps":
            print(f"words: {event.words}")

    with open("stream.mp3", "wb") as f:
        f.write(audio_bytes)

asyncio.run(main())
linear16 streaming caveat: linear16 re-emits a WAV header on every chunk. If you need raw streaming PCM, use encoding: "pcm" and assemble your own header at the end.

See also

Authorizations

x-api-key
string
header
required

Body

application/json
text
string
required
voice
string
required
normalize_text
boolean
default:false
response_format
AudioResponseFormat · object
temperature
number | null
timestamps
boolean
default:false

Response

Successful Response