Skip to content

Audio input

Try in Sandbox Opens a live chat playground with this example prefilled — add your API key and attach an audio file, then hit Send. Runs in your browser; no code is executed.

Send a WAV file saying 'hello', prompt 'Transcribe this audio. Reply with only the spoken words.', and assert the response matches /hello/i on OpenAI (Chat Completions audio model) and Google Gemini. Anthropic does not support audio input in complete().

Use audio input to a chat model when you want the model to reason about the audio alongside other context — combining speech recognition, intent detection, sentiment analysis, and a text response in one call. Examples: classify a support call recording, answer a question spoken in an audio clip, or transcribe-and-reply in one step.

This is different from transcribe() (article 18). transcribe() calls a dedicated speech-to-text endpoint (OpenAI /v1/audio/transcriptions); it is cheaper and returns only text. Audio input to complete() runs the full multimodal model — more capable, higher cost per second of audio.

The challenge with raw provider SDKs:

  • OpenAI Chat Completions (audio models) requires modalities: ['text', 'audio'] on the request body, an audio object with voice and format, and the audio itself as { type: 'input_audio', input_audio: { data, format } } — a completely separate content block from images.
  • Google accepts audio as an inlineData part with the audio MIME type, exactly like images. No extra modality flags.

The attachments API handles both paths with one call.

import { complete } from '@combycode/llm-sdk';
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: 'Transcribe this audio. Reply with only the spoken words.',
attachments: ['./hello.wav'],
maxTokens: 256,
});
console.log(text); // "hello"

loadContent() detects audio/wav from the .wav extension (or from the RIFF+WAVE magic bytes), base64-encodes the file, and returns an AudioPart with a base64 DataSource. The provider adapter then uses the correct wire shape.

import { readFileSync } from 'fs';
const audioBytes = new Uint8Array(readFileSync('./recording.wav'));
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: 'What language is being spoken in this recording?',
attachments: [audioBytes],
maxTokens: 64,
});

MIME is detected from the leading bytes: RIFF+WAVE -> audio/wav, ID3/FF Fx -> audio/mpeg. For formats without a magic-byte signature (Opus, FLAC, AAC), use the file-path form so the extension can be used for MIME detection.

When you need to supply an audio format that the extension-based MIME sniff would not detect, build the AudioPart directly:

import { complete } from '@combycode/llm-sdk';
import type { AudioPart } from '@combycode/llm-sdk';
import { readFileSync } from 'fs';
import { Buffer } from 'buffer';
const raw = new Uint8Array(readFileSync('./clip.mp3'));
const b64 = Buffer.from(raw).toString('base64');
const audioPart: AudioPart = {
type: 'audio',
source: { type: 'base64', mimeType: 'audio/mpeg', data: b64 },
};
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Summarise what is said in this clip.' },
audioPart,
],
},
],
maxTokens: 256,
});

Step 4 — Control the output voice for OpenAI audio models

Section titled “Step 4 — Control the output voice for OpenAI audio models”

OpenAI’s audio-capable Chat Completions models (e.g. gpt-4o-audio-preview) can return both a text transcript and an audio reply in the same call. You control the output voice and format via audio:

const { text } = await complete({
model: 'openai/gpt-4o-audio-preview',
apiKey: process.env.OPENAI_API_KEY,
prompt: 'Listen to this and reply in English.',
attachments: ['./question.wav'],
audio: { voice: 'alloy', format: 'wav' },
maxTokens: 512,
});
// text contains the transcript / text reply
// audio output (spoken reply) is in the response parts when present

audio.voice and audio.format are forwarded by the adapter into the audio object on the Chat Completions body. They have no effect on Google (which ignores the audio option when audio OUTPUT is not requested).

AudioPart shape (ContentPart of type 'audio'):

FieldTypeDescription
type'audio'Discriminator — set by loadContent() when MIME starts with audio/.
sourceDataSourceWhere the audio bytes come from (see below).

DataSource variants for audio input:

typeRequired fieldsWhen to use
'base64'mimeType: string, data: stringAudio bytes encoded as base64 (no data-URL prefix). Output of loadContent().
'buffer'mimeType: string, data: Uint8ArrayRaw bytes in memory with explicit MIME.
'path'mimeType: string, path: stringLocal file path (Node/Bun). SDK reads and encodes. Prefer attachments for simplicity.
'url'url: stringRemote URL. SDK fetches, encodes, sends as base64.
'file'fileId: stringFiles API reference. Google accepts fileData with a URI; not supported by OpenAI audio input path.

Audio MIME types auto-detected:

Extension / Magic bytesMIME type
.wav / RIFF+WAVE headeraudio/wav
.mp3 / ID3 tag or FF Exaudio/mpeg
.m4aaudio/mp4
.oggaudio/ogg
.flacaudio/flac

OpenAI audio option (output control when sending audio input):

FieldValuesNotes
voice'alloy', 'coral', 'shimmer', 'echo', or a voice alias ('neutral', 'warm', 'bright', 'deep')Voice for the spoken audio reply. Aliases are resolved by the SDK.
format'wav', 'mp3', 'opus', 'flac', 'pcm16'Format of the spoken audio reply. 'aac' falls back to 'wav'.

Provider support for audio input:

ProviderModelsSupported audio formatsNotes
OpenAI (Chat Completions)gpt-4o-audio-preview, gpt-4o-mini-audio-previewWAV, MP3 onlyRequires modalities: ['text', 'audio'] — the adapter sets this automatically when an AudioPart is detected.
GoogleGemini 1.5+, Gemini 2.0+WAV, MP3, OGG, FLAC, M4A, many moreinlineData path; no extra flags needed.
AnthropicNoneAudio input in complete() is not supported. Use transcribe() for STT.

Audio input vs. transcribe() — which to use:

Audio input to complete()transcribe() (article 18)
What it doesFull multimodal model call: understand + respondDedicated STT endpoint: speech -> text only
OutputText reply (+ optional audio reply on OpenAI){ text } only
CostPer-token pricing for the full modelPer-minute pricing (OpenAI Whisper / gpt-4o-transcribe)
Best forReason about audio content, combine with other contextHigh-volume cheap transcription
ProviderOpenAI audio models, Google GeminiOpenAI (dedicated endpoint), Google (via completion)
import { complete } from '@combycode/llm-sdk';

// Audio understanding via the unified attachments API — loadContent() detects the
// wav by MIME and attaches it as input audio. (Official openai needs modalities:
// ['text','audio'] + audio config and reads message.audio.transcript.)
const t0 = performance.now();
const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'Transcribe this audio. Reply with only the spoken words.',
  attachments: ['../../official-samples/_fixtures/hello.wav'],
  maxTokens: 256,
});
console.log(JSON.stringify({ result: text.trim() || 'empty', ms: Math.round(performance.now() - t0) }));

OpenAI Chat Completions requires three extra fields for audio input: modalities, audio.voice, and audio.format on the request body, plus a non-standard input_audio content block type. Google needs no extra flags — just inlineData. The ORXA adapter detects AudioPart in the message content and automatically sets modalities: ['text', 'audio'] on the OpenAI request; for Google it maps to inlineData. Your code remains the same for both providers.

OpenAI audio models only accept WAV and MP3. The adapter coerces mimeType to wav for anything that does not contain mpeg or mp3. Send your audio as WAV or MP3 to be safe.

OpenAI always enables audio output when audio input is detected. When the adapter sees an AudioPart in the message content it adds modalities: ['text', 'audio'] to the body and requires a voice. The default voice is 'alloy'. If you only want transcription and not a spoken reply, use transcribe() instead.

Google Gemini audio support is broad. Linear PCM, WAV, MP3, AAC, AIFF, FLAC, OGG, and more are accepted. Duration limits vary by model (typically up to 8.4 hours of audio per request).

Cost is not per-minute here. Audio input to complete() is billed as tokens (the audio is tokenised internally). OpenAI charges roughly 1 token per 0.1s of audio. Google charges per token. Use transcribe() for bulk speech-to-text at per-minute rates.

Next steps: