Text to speech (TTS)

▶ Try in Sandbox Opens a live chat playground with this example prefilled — add your API key then hit Send. Runs in your browser; no code is executed.

What you will achieve

Generate audio bytes for the input 'hello' and confirm non-empty audio is returned — same generateAudio() call for OpenAI and Google (Anthropic has no TTS API).

When and why you need this

Text-to-speech converts a string into a spoken audio file for playback, accessibility, voice assistants, or audio content pipelines.

The challenge with raw provider SDKs:

OpenAI calls client.audio.speech.create({ model, input, voice, response_format }) and returns a binary stream that you must buffer to disk.
Google Gemini TTS has no dedicated TTS endpoint — it calls generateContent with responseModalities: ['AUDIO'] and a speechConfig voice config, then extracts raw PCM bytes from inlineData. The PCM data has no container (no WAV header); you must wrap it in a WAV container yourself before it is playable.

createMediaOutput().generateAudio() handles both paths, wraps the Google PCM output in a WAV container automatically, and writes the bytes to dir.

Step by step

Step 1 — Create a media handle

import { createMediaOutput } from '@combycode/llm-sdk';

const media = createMediaOutput({
  model: 'openai/tts-1',
  apiKey: process.env.OPENAI_API_KEY,
  dir: './.media-out',
});

dir is required in Node/Bun. The SDK creates the directory if it does not exist. In the browser use store: new MemoryMediaStore() instead.

Step 2 — Generate audio

const audio = await media.generateAudio({
  input: 'Hello, world.',
  params: { voice: 'alloy', format: 'wav' },
});

console.log(`saved ${audio.meta.size} bytes, id: ${audio.id}`);
// audio.mimeType -> 'audio/wav'
// audio.meta.provider -> 'openai'

generateAudio() returns a single MediaResult (not an array). The audio bytes are written to dir.

Step 3 — Use a voice alias

const audio = await media.generateAudio({
  input: 'Good morning.',
  params: { voice: 'warm' },  // alias -> 'coral' on OpenAI, 'Aoede' on Google
});

The SDK’s voice alias system maps four unified names to provider-specific voice ids:

Alias	OpenAI voice	Google voice
`'neutral'`	`'alloy'`	`'Kore'`
`'warm'`	`'coral'`	`'Aoede'`
`'bright'`	`'shimmer'`	`'Zephyr'`
`'deep'`	`'echo'`	`'Charon'`

Any string not in the alias table is passed through verbatim — raw provider voice ids always work.

Step 4 — Switch to Google Gemini TTS

const googleMedia = createMediaOutput({
  model: 'google/gemini-2.5-flash-preview-tts',
  apiKey: process.env.GOOGLE_API_KEY,
  dir: './.media-out',
});

const audio = await googleMedia.generateAudio({
  input: 'Bienvenido.',
  params: { voice: 'Kore' },
});
// audio.mimeType -> 'audio/wav' (PCM wrapped in WAV by the adapter)

Google Gemini TTS returns raw 16-bit PCM at 24000 Hz. The Google adapter wraps it in a WAV container via ensurePlayableAudio() before saving, so audio.mimeType is always 'audio/wav' and the file is directly playable.

Step 5 — Control speed and instructions (OpenAI)

const audio = await media.generateAudio({
  input: 'This is a slow, careful reading.',
  params: {
    voice: 'alloy',
    format: 'mp3',
    speed: 0.75,
    instructions: 'Speak slowly and clearly, as if explaining to a child.',
  },
});

speed and instructions are OpenAI-specific. speed defaults to 1.0 (range 0.25-4.0). instructions is a style guide string available on newer TTS models like gpt-4o-mini-tts. Google ignores both.

Your options

generateAudio() parameter: AudioGenRequest.params:

Param	Type	Providers	Description
`voice`	`string`	OpenAI, Google	Voice id or alias. OpenAI: `'alloy'`, `'coral'`, `'shimmer'`, `'echo'`, `'fable'`, `'onyx'`, `'nova'`, `'sage'`, plus aliases. Google: `'Kore'`, `'Aoede'`, `'Zephyr'`, `'Charon'`, and many more Gemini voice names.
`format`	`string`	OpenAI	Output audio format. See format table below. Google always returns WAV (PCM wrapped).
`speed`	`number`	OpenAI	Playback speed multiplier. Range 0.25-4.0. Default 1.0.
`instructions`	`string`	OpenAI (newer models)	Style/tone instructions for the voice. Available on `gpt-4o-mini-tts` and similar.
`sampleRate`	`number`	Internal	PCM sample rate override for raw output.
`language`	`string`	Google	Language hint (BCP-47).

OpenAI output formats (params.format):

Format	MIME type returned	When to use
`'mp3'` (default)	`audio/mp3`	General use; good compression, wide support.
`'wav'`	`audio/wav`	Lossless; larger file, no encoding artifacts. Good for further audio processing.
`'opus'`	`audio/opus`	Best compression for streaming; requires Opus decoder.
`'flac'`	`audio/flac`	Lossless, compressed. Good for archiving.
`'pcm16'`	`audio/pcm`	Raw signed 16-bit PCM, no container. For real-time pipelines feeding into audio processing code.
`'aac'`	Falls back to `wav`	AAC is not supported by OpenAI TTS; the SDK silently uses `wav`.

OpenAI TTS models:

Model	Notes
`tts-1`	Standard quality, lowest latency, cheapest.
`tts-1-hd`	Higher audio quality, higher cost.
`gpt-4o-mini-tts`	Newest; supports `instructions` param; best quality overall.

Google TTS models:

Model	Notes
`gemini-2.5-flash-preview-tts`	Default used by the adapter. Returns PCM wrapped to WAV.

createMediaOutput() options (same as image gen — see article 16 for full table). Key ones for TTS:

Option	Description
`model`	Namespaced model id (`'openai/tts-1'`).
`dir`	Output directory (Node/Bun).
`store`	Custom `MediaStore` for browser or alternative persistence.

MediaResult fields for audio:

Field	Description
`id`	`'aud_<uuid>'` — use to load bytes from the store.
`type`	`'audio'`
`mimeType`	Format-dependent. `'audio/wav'` for WAV and Google output. `'audio/mp3'` for OpenAI MP3, etc.
`meta.size`	Byte count.
`meta.durationMs`	Duration in milliseconds, when reported by the provider.
`meta.sampleRate`	Sample rate in Hz, when reported. Google PCM is 24000 Hz.

Cost note: OpenAI TTS is priced per character of input text (not per token). tts-1 is approximately $0.015/1K characters; tts-1-hd is $0.030/1K characters. Google Gemini TTS is priced per output token (the audio is internally tokenised). The SDK emits cost via onCostEntry so the cost collector tracks it alongside chat completions.

Compare the SDKs

import { createMediaOutput } from '@combycode/llm-sdk';

// Unified TTS via the same media handle. (Official: openai audio.speech.create vs
// google generateContent responseModalities:['AUDIO'] + speechConfig.)
const provider = (process.env.LLM_MODEL ?? '').split('/')[0];
const media = createMediaOutput({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  dir: './.media-out',
});

const t0 = performance.now();
const audio = await media.generateAudio({
  input: 'hello',
  params: { voice: provider === 'google' ? 'Kore' : 'alloy', format: 'wav' },
});
console.log(JSON.stringify({ result: String(audio?.meta.size ?? 0), ms: Math.round(performance.now() - t0) }));

OpenAI’s SDK returns a Response object that you must stream to disk manually (response.body.pipe(fs.createWriteStream(...))). Google has no dedicated TTS method; you call client.models.generateContent(), extract candidates[0].content.parts[0].inlineData.data (raw base64 PCM), decode it, and construct a WAV file yourself. ORXA calls generateAudio() once and returns a MediaResult with the file already saved and playable.

Gotchas and next steps

Google returns raw PCM, not a WAV file. The Google adapter wraps the PCM in a WAV container via ensurePlayableAudio(). If you load the raw bytes from meta and inspect them, they will have a RIFF/WAVE header. This is intentional — the raw PCM returned by Gemini (audio/l16; rate=...) is not playable by most audio players.

format is OpenAI-only. Google TTS always returns WAV regardless of what you pass in params.format. If you need a specific format from Google, transcode the WAV output with a library like ffmpeg.

instructions requires a compatible model. gpt-4o-mini-tts supports it; tts-1 and tts-1-hd do not. The API ignores unsupported parameters silently — test with your chosen model.

Voice ids are provider-specific beyond aliases. OpenAI has nine standard voices; Google has many more Gemini voice names (e.g. 'Puck', 'Orbit', 'Fenrir'). The four aliases (neutral/warm/bright/deep) map to one sensible voice per provider. For full creative control use the raw provider voice id string.

TTS output is not streamed. generateAudio() waits for the complete audio before returning. For streaming TTS (word-by-word audio), use the Realtime API.

Next steps:

Speech to text (STT) — transcribe audio back to text
Audio input — send audio to a multimodal model
Realtime — live bidirectional streaming audio