Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kova.ai/llms.txt

Use this file to discover all available pages before exploring further.

This walkthrough shows feeding sentences from a long document into a single WebSocket context, receiving PCM audio chunks as they’re generated, and writing the result to a single WAV file at the end.

Setup

Install the SDK and have KOVA_API_KEY set in your environment.
pip install kova-tts
export KOVA_API_KEY=kova_sk_...

Example

import asyncio, os
from kova_tts import KovaTTSClient, AudioResponseFormat

DOCUMENT = """
Welcome to Kova.
This is the second sentence.
And here is one more, just to round things out.
""".strip().splitlines()


async def main():
    client = KovaTTSClient(api_key=os.environ["KOVA_API_KEY"])
    sample_rate = 32000
    pcm_chunks: list[bytes] = []

    async with client.websocket() as ws:
        await ws.start_context(
            context_id="doc",
            voice_id="cal",
            model_id="default",
            response_format=AudioResponseFormat(encoding="pcm", sample_rate=sample_rate),
        )

        for line in DOCUMENT:
            await ws.send_text(line + " ", context_id="doc")

        await ws.flush(context_id="doc", flush_id="end")

        async for frame in ws:
            if frame.type == "audio":
                pcm_chunks.append(frame.audio)
            elif frame.type == "flush_completed" and frame.flush_id == "end":
                break

        await ws.close_context(context_id="doc")

    # The SDK does not currently expose a WAV-writing helper.
    # Use stdlib `wave` to write a 16-bit mono WAV.
    import wave
    with wave.open("doc.wav", "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)         # 16-bit PCM
        wf.setframerate(sample_rate)
        wf.writeframes(b"".join(pcm_chunks))
    print(f"wrote doc.wav ({sum(len(c) for c in pcm_chunks)} pcm bytes)")


asyncio.run(main())

What’s happening

  1. Open one context for the whole document. Reusing a context means the voice and format stay consistent across the entire output, with no warm-up overhead between sentences.
  2. Send each sentence as a separate send_text frame with trailing whitespace so the model treats them as sequential prose, not concatenated tokens.
  3. Audio streams back incrementally — by the time you’ve sent the third sentence, audio for the first is already arriving.
  4. flush with a sentinel flush_id lets you know when the server has finished generating the last sentence (you wait for the matching flush_completed).
  5. Assemble PCM and write a single WAV file at the end. Python uses stdlib wave; JS uses the exported pcm16ToWavBytes primitive + node:fs. The SDKs don’t currently ship a higher-level writePcm16WavFile helper — if one is added later, this page should be updated to prefer it.

Variations

  • Multiple parallel contexts: open ctx-narration and ctx-sfx concurrently with different voices. Demultiplex audio_chunk by context_id on the client.
  • Different output format: swap encoding: "pcm" for encoding: "mp3" and skip the WAV-header step.
  • Without timestamps: omit timestamps: true from start_context to skip word-timing frames.