Skip to content

Text to speech (TTS)

Try in Sandbox Opens a live chat playground with this example prefilled — add your API key then hit Send. Runs in your browser; no code is executed.

Generate audio bytes for the input 'hello' and confirm non-empty audio is returned — same generateAudio() call for OpenAI and Google (Anthropic has no TTS API).

Text-to-speech converts a string into a spoken audio file for playback, accessibility, voice assistants, or audio content pipelines.

The challenge with raw provider SDKs:

  • OpenAI calls client.audio.speech.create({ model, input, voice, response_format }) and returns a binary stream that you must buffer to disk.
  • Google Gemini TTS has no dedicated TTS endpoint — it calls generateContent with responseModalities: ['AUDIO'] and a speechConfig voice config, then extracts raw PCM bytes from inlineData. The PCM data has no container (no WAV header); you must wrap it in a WAV container yourself before it is playable.

createMediaOutput().generateAudio() handles both paths, wraps the Google PCM output in a WAV container automatically, and writes the bytes to dir.

import { createMediaOutput } from '@combycode/llm-sdk';
const media = createMediaOutput({
model: 'openai/tts-1',
apiKey: process.env.OPENAI_API_KEY,
dir: './.media-out',
});

dir is required in Node/Bun. The SDK creates the directory if it does not exist. In the browser use store: new MemoryMediaStore() instead.

const audio = await media.generateAudio({
input: 'Hello, world.',
params: { voice: 'alloy', format: 'wav' },
});
console.log(`saved ${audio.meta.size} bytes, id: ${audio.id}`);
// audio.mimeType -> 'audio/wav'
// audio.meta.provider -> 'openai'

generateAudio() returns a single MediaResult (not an array). The audio bytes are written to dir.

const audio = await media.generateAudio({
input: 'Good morning.',
params: { voice: 'warm' }, // alias -> 'coral' on OpenAI, 'Aoede' on Google
});

The SDK’s voice alias system maps four unified names to provider-specific voice ids:

AliasOpenAI voiceGoogle voice
'neutral''alloy''Kore'
'warm''coral''Aoede'
'bright''shimmer''Zephyr'
'deep''echo''Charon'

Any string not in the alias table is passed through verbatim — raw provider voice ids always work.

const googleMedia = createMediaOutput({
model: 'google/gemini-2.5-flash-preview-tts',
apiKey: process.env.GOOGLE_API_KEY,
dir: './.media-out',
});
const audio = await googleMedia.generateAudio({
input: 'Bienvenido.',
params: { voice: 'Kore' },
});
// audio.mimeType -> 'audio/wav' (PCM wrapped in WAV by the adapter)

Google Gemini TTS returns raw 16-bit PCM at 24000 Hz. The Google adapter wraps it in a WAV container via ensurePlayableAudio() before saving, so audio.mimeType is always 'audio/wav' and the file is directly playable.

Step 5 — Control speed and instructions (OpenAI)

Section titled “Step 5 — Control speed and instructions (OpenAI)”
const audio = await media.generateAudio({
input: 'This is a slow, careful reading.',
params: {
voice: 'alloy',
format: 'mp3',
speed: 0.75,
instructions: 'Speak slowly and clearly, as if explaining to a child.',
},
});

speed and instructions are OpenAI-specific. speed defaults to 1.0 (range 0.25-4.0). instructions is a style guide string available on newer TTS models like gpt-4o-mini-tts. Google ignores both.

generateAudio() parameter: AudioGenRequest.params:

ParamTypeProvidersDescription
voicestringOpenAI, GoogleVoice id or alias. OpenAI: 'alloy', 'coral', 'shimmer', 'echo', 'fable', 'onyx', 'nova', 'sage', plus aliases. Google: 'Kore', 'Aoede', 'Zephyr', 'Charon', and many more Gemini voice names.
formatstringOpenAIOutput audio format. See format table below. Google always returns WAV (PCM wrapped).
speednumberOpenAIPlayback speed multiplier. Range 0.25-4.0. Default 1.0.
instructionsstringOpenAI (newer models)Style/tone instructions for the voice. Available on gpt-4o-mini-tts and similar.
sampleRatenumberInternalPCM sample rate override for raw output.
languagestringGoogleLanguage hint (BCP-47).

OpenAI output formats (params.format):

FormatMIME type returnedWhen to use
'mp3' (default)audio/mp3General use; good compression, wide support.
'wav'audio/wavLossless; larger file, no encoding artifacts. Good for further audio processing.
'opus'audio/opusBest compression for streaming; requires Opus decoder.
'flac'audio/flacLossless, compressed. Good for archiving.
'pcm16'audio/pcmRaw signed 16-bit PCM, no container. For real-time pipelines feeding into audio processing code.
'aac'Falls back to wavAAC is not supported by OpenAI TTS; the SDK silently uses wav.

OpenAI TTS models:

ModelNotes
tts-1Standard quality, lowest latency, cheapest.
tts-1-hdHigher audio quality, higher cost.
gpt-4o-mini-ttsNewest; supports instructions param; best quality overall.

Google TTS models:

ModelNotes
gemini-2.5-flash-preview-ttsDefault used by the adapter. Returns PCM wrapped to WAV.

createMediaOutput() options (same as image gen — see article 16 for full table). Key ones for TTS:

OptionDescription
modelNamespaced model id ('openai/tts-1').
dirOutput directory (Node/Bun).
storeCustom MediaStore for browser or alternative persistence.

MediaResult fields for audio:

FieldDescription
id'aud_<uuid>' — use to load bytes from the store.
type'audio'
mimeTypeFormat-dependent. 'audio/wav' for WAV and Google output. 'audio/mp3' for OpenAI MP3, etc.
meta.sizeByte count.
meta.durationMsDuration in milliseconds, when reported by the provider.
meta.sampleRateSample rate in Hz, when reported. Google PCM is 24000 Hz.

Cost note: OpenAI TTS is priced per character of input text (not per token). tts-1 is approximately $0.015/1K characters; tts-1-hd is $0.030/1K characters. Google Gemini TTS is priced per output token (the audio is internally tokenised). The SDK emits cost via onCostEntry so the cost collector tracks it alongside chat completions.

import { createMediaOutput } from '@combycode/llm-sdk';

// Unified TTS via the same media handle. (Official: openai audio.speech.create vs
// google generateContent responseModalities:['AUDIO'] + speechConfig.)
const provider = (process.env.LLM_MODEL ?? '').split('/')[0];
const media = createMediaOutput({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  dir: './.media-out',
});

const t0 = performance.now();
const audio = await media.generateAudio({
  input: 'hello',
  params: { voice: provider === 'google' ? 'Kore' : 'alloy', format: 'wav' },
});
console.log(JSON.stringify({ result: String(audio?.meta.size ?? 0), ms: Math.round(performance.now() - t0) }));

OpenAI’s SDK returns a Response object that you must stream to disk manually (response.body.pipe(fs.createWriteStream(...))). Google has no dedicated TTS method; you call client.models.generateContent(), extract candidates[0].content.parts[0].inlineData.data (raw base64 PCM), decode it, and construct a WAV file yourself. ORXA calls generateAudio() once and returns a MediaResult with the file already saved and playable.

Google returns raw PCM, not a WAV file. The Google adapter wraps the PCM in a WAV container via ensurePlayableAudio(). If you load the raw bytes from meta and inspect them, they will have a RIFF/WAVE header. This is intentional — the raw PCM returned by Gemini (audio/l16; rate=...) is not playable by most audio players.

format is OpenAI-only. Google TTS always returns WAV regardless of what you pass in params.format. If you need a specific format from Google, transcode the WAV output with a library like ffmpeg.

instructions requires a compatible model. gpt-4o-mini-tts supports it; tts-1 and tts-1-hd do not. The API ignores unsupported parameters silently — test with your chosen model.

Voice ids are provider-specific beyond aliases. OpenAI has nine standard voices; Google has many more Gemini voice names (e.g. 'Puck', 'Orbit', 'Fenrir'). The four aliases (neutral/warm/bright/deep) map to one sensible voice per provider. For full creative control use the raw provider voice id string.

TTS output is not streamed. generateAudio() waits for the complete audio before returning. For streaming TTS (word-by-word audio), use the Realtime API.

Next steps: