Text to speech (TTS)
What you will achieve
Section titled “What you will achieve”Generate audio bytes for the input 'hello' and confirm non-empty audio is returned — same generateAudio() call for OpenAI and Google (Anthropic has no TTS API).
When and why you need this
Section titled “When and why you need this”Text-to-speech converts a string into a spoken audio file for playback, accessibility, voice assistants, or audio content pipelines.
The challenge with raw provider SDKs:
- OpenAI calls
client.audio.speech.create({ model, input, voice, response_format })and returns a binary stream that you must buffer to disk. - Google Gemini TTS has no dedicated TTS endpoint — it calls
generateContentwithresponseModalities: ['AUDIO']and aspeechConfigvoice config, then extracts raw PCM bytes frominlineData. The PCM data has no container (no WAV header); you must wrap it in a WAV container yourself before it is playable.
createMediaOutput().generateAudio() handles both paths, wraps the Google PCM output in a WAV container automatically, and writes the bytes to dir.
Step by step
Section titled “Step by step”Step 1 — Create a media handle
Section titled “Step 1 — Create a media handle”import { createMediaOutput } from '@combycode/llm-sdk';
const media = createMediaOutput({ model: 'openai/tts-1', apiKey: process.env.OPENAI_API_KEY, dir: './.media-out',});dir is required in Node/Bun. The SDK creates the directory if it does not exist. In the browser use store: new MemoryMediaStore() instead.
Step 2 — Generate audio
Section titled “Step 2 — Generate audio”const audio = await media.generateAudio({ input: 'Hello, world.', params: { voice: 'alloy', format: 'wav' },});
console.log(`saved ${audio.meta.size} bytes, id: ${audio.id}`);// audio.mimeType -> 'audio/wav'// audio.meta.provider -> 'openai'generateAudio() returns a single MediaResult (not an array). The audio bytes are written to dir.
Step 3 — Use a voice alias
Section titled “Step 3 — Use a voice alias”const audio = await media.generateAudio({ input: 'Good morning.', params: { voice: 'warm' }, // alias -> 'coral' on OpenAI, 'Aoede' on Google});The SDK’s voice alias system maps four unified names to provider-specific voice ids:
| Alias | OpenAI voice | Google voice |
|---|---|---|
'neutral' | 'alloy' | 'Kore' |
'warm' | 'coral' | 'Aoede' |
'bright' | 'shimmer' | 'Zephyr' |
'deep' | 'echo' | 'Charon' |
Any string not in the alias table is passed through verbatim — raw provider voice ids always work.
Step 4 — Switch to Google Gemini TTS
Section titled “Step 4 — Switch to Google Gemini TTS”const googleMedia = createMediaOutput({ model: 'google/gemini-2.5-flash-preview-tts', apiKey: process.env.GOOGLE_API_KEY, dir: './.media-out',});
const audio = await googleMedia.generateAudio({ input: 'Bienvenido.', params: { voice: 'Kore' },});// audio.mimeType -> 'audio/wav' (PCM wrapped in WAV by the adapter)Google Gemini TTS returns raw 16-bit PCM at 24000 Hz. The Google adapter wraps it in a WAV container via ensurePlayableAudio() before saving, so audio.mimeType is always 'audio/wav' and the file is directly playable.
Step 5 — Control speed and instructions (OpenAI)
Section titled “Step 5 — Control speed and instructions (OpenAI)”const audio = await media.generateAudio({ input: 'This is a slow, careful reading.', params: { voice: 'alloy', format: 'mp3', speed: 0.75, instructions: 'Speak slowly and clearly, as if explaining to a child.', },});speed and instructions are OpenAI-specific. speed defaults to 1.0 (range 0.25-4.0). instructions is a style guide string available on newer TTS models like gpt-4o-mini-tts. Google ignores both.
Your options
Section titled “Your options”generateAudio() parameter: AudioGenRequest.params:
| Param | Type | Providers | Description |
|---|---|---|---|
voice | string | OpenAI, Google | Voice id or alias. OpenAI: 'alloy', 'coral', 'shimmer', 'echo', 'fable', 'onyx', 'nova', 'sage', plus aliases. Google: 'Kore', 'Aoede', 'Zephyr', 'Charon', and many more Gemini voice names. |
format | string | OpenAI | Output audio format. See format table below. Google always returns WAV (PCM wrapped). |
speed | number | OpenAI | Playback speed multiplier. Range 0.25-4.0. Default 1.0. |
instructions | string | OpenAI (newer models) | Style/tone instructions for the voice. Available on gpt-4o-mini-tts and similar. |
sampleRate | number | Internal | PCM sample rate override for raw output. |
language | string | Language hint (BCP-47). |
OpenAI output formats (params.format):
| Format | MIME type returned | When to use |
|---|---|---|
'mp3' (default) | audio/mp3 | General use; good compression, wide support. |
'wav' | audio/wav | Lossless; larger file, no encoding artifacts. Good for further audio processing. |
'opus' | audio/opus | Best compression for streaming; requires Opus decoder. |
'flac' | audio/flac | Lossless, compressed. Good for archiving. |
'pcm16' | audio/pcm | Raw signed 16-bit PCM, no container. For real-time pipelines feeding into audio processing code. |
'aac' | Falls back to wav | AAC is not supported by OpenAI TTS; the SDK silently uses wav. |
OpenAI TTS models:
| Model | Notes |
|---|---|
tts-1 | Standard quality, lowest latency, cheapest. |
tts-1-hd | Higher audio quality, higher cost. |
gpt-4o-mini-tts | Newest; supports instructions param; best quality overall. |
Google TTS models:
| Model | Notes |
|---|---|
gemini-2.5-flash-preview-tts | Default used by the adapter. Returns PCM wrapped to WAV. |
createMediaOutput() options (same as image gen — see article 16 for full table). Key ones for TTS:
| Option | Description |
|---|---|
model | Namespaced model id ('openai/tts-1'). |
dir | Output directory (Node/Bun). |
store | Custom MediaStore for browser or alternative persistence. |
MediaResult fields for audio:
| Field | Description |
|---|---|
id | 'aud_<uuid>' — use to load bytes from the store. |
type | 'audio' |
mimeType | Format-dependent. 'audio/wav' for WAV and Google output. 'audio/mp3' for OpenAI MP3, etc. |
meta.size | Byte count. |
meta.durationMs | Duration in milliseconds, when reported by the provider. |
meta.sampleRate | Sample rate in Hz, when reported. Google PCM is 24000 Hz. |
Cost note: OpenAI TTS is priced per character of input text (not per token). tts-1 is approximately $0.015/1K characters; tts-1-hd is $0.030/1K characters. Google Gemini TTS is priced per output token (the audio is internally tokenised). The SDK emits cost via onCostEntry so the cost collector tracks it alongside chat completions.
Compare the SDKs
Section titled “Compare the SDKs”OpenAI’s SDK returns a Response object that you must stream to disk manually (response.body.pipe(fs.createWriteStream(...))). Google has no dedicated TTS method; you call client.models.generateContent(), extract candidates[0].content.parts[0].inlineData.data (raw base64 PCM), decode it, and construct a WAV file yourself. ORXA calls generateAudio() once and returns a MediaResult with the file already saved and playable.
Gotchas and next steps
Section titled “Gotchas and next steps”Google returns raw PCM, not a WAV file. The Google adapter wraps the PCM in a WAV container via ensurePlayableAudio(). If you load the raw bytes from meta and inspect them, they will have a RIFF/WAVE header. This is intentional — the raw PCM returned by Gemini (audio/l16; rate=...) is not playable by most audio players.
format is OpenAI-only. Google TTS always returns WAV regardless of what you pass in params.format. If you need a specific format from Google, transcode the WAV output with a library like ffmpeg.
instructions requires a compatible model. gpt-4o-mini-tts supports it; tts-1 and tts-1-hd do not. The API ignores unsupported parameters silently — test with your chosen model.
Voice ids are provider-specific beyond aliases. OpenAI has nine standard voices; Google has many more Gemini voice names (e.g. 'Puck', 'Orbit', 'Fenrir'). The four aliases (neutral/warm/bright/deep) map to one sensible voice per provider. For full creative control use the raw provider voice id string.
TTS output is not streamed. generateAudio() waits for the complete audio before returning. For streaming TTS (word-by-word audio), use the Realtime API.
Next steps:
- Speech to text (STT) — transcribe audio back to text
- Audio input — send audio to a multimodal model
- Realtime — live bidirectional streaming audio