Audio input
What you will achieve
Section titled “What you will achieve”Send a WAV file saying 'hello', prompt 'Transcribe this audio. Reply with only the spoken words.', and assert the response matches /hello/i on OpenAI (Chat Completions audio model) and Google Gemini. Anthropic does not support audio input in complete().
When and why you need this
Section titled “When and why you need this”Use audio input to a chat model when you want the model to reason about the audio alongside other context — combining speech recognition, intent detection, sentiment analysis, and a text response in one call. Examples: classify a support call recording, answer a question spoken in an audio clip, or transcribe-and-reply in one step.
This is different from transcribe() (article 18). transcribe() calls a dedicated speech-to-text endpoint (OpenAI /v1/audio/transcriptions); it is cheaper and returns only text. Audio input to complete() runs the full multimodal model — more capable, higher cost per second of audio.
The challenge with raw provider SDKs:
- OpenAI Chat Completions (audio models) requires
modalities: ['text', 'audio']on the request body, anaudioobject withvoiceandformat, and the audio itself as{ type: 'input_audio', input_audio: { data, format } }— a completely separate content block from images. - Google accepts audio as an
inlineDatapart with the audio MIME type, exactly like images. No extra modality flags.
The attachments API handles both paths with one call.
Step by step
Section titled “Step by step”Step 1 — Send a WAV file by path
Section titled “Step 1 — Send a WAV file by path”import { complete } from '@combycode/llm-sdk';
const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: 'Transcribe this audio. Reply with only the spoken words.', attachments: ['./hello.wav'], maxTokens: 256,});
console.log(text); // "hello"loadContent() detects audio/wav from the .wav extension (or from the RIFF+WAVE magic bytes), base64-encodes the file, and returns an AudioPart with a base64 DataSource. The provider adapter then uses the correct wire shape.
Step 2 — Pass raw audio bytes
Section titled “Step 2 — Pass raw audio bytes”import { readFileSync } from 'fs';
const audioBytes = new Uint8Array(readFileSync('./recording.wav'));
const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: 'What language is being spoken in this recording?', attachments: [audioBytes], maxTokens: 64,});MIME is detected from the leading bytes: RIFF+WAVE -> audio/wav, ID3/FF Fx -> audio/mpeg. For formats without a magic-byte signature (Opus, FLAC, AAC), use the file-path form so the extension can be used for MIME detection.
Step 3 — Build an AudioPart manually
Section titled “Step 3 — Build an AudioPart manually”When you need to supply an audio format that the extension-based MIME sniff would not detect, build the AudioPart directly:
import { complete } from '@combycode/llm-sdk';import type { AudioPart } from '@combycode/llm-sdk';import { readFileSync } from 'fs';import { Buffer } from 'buffer';
const raw = new Uint8Array(readFileSync('./clip.mp3'));const b64 = Buffer.from(raw).toString('base64');
const audioPart: AudioPart = { type: 'audio', source: { type: 'base64', mimeType: 'audio/mpeg', data: b64 },};
const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, messages: [ { role: 'user', content: [ { type: 'text', text: 'Summarise what is said in this clip.' }, audioPart, ], }, ], maxTokens: 256,});Step 4 — Control the output voice for OpenAI audio models
Section titled “Step 4 — Control the output voice for OpenAI audio models”OpenAI’s audio-capable Chat Completions models (e.g. gpt-4o-audio-preview) can return both a text transcript and an audio reply in the same call. You control the output voice and format via audio:
const { text } = await complete({ model: 'openai/gpt-4o-audio-preview', apiKey: process.env.OPENAI_API_KEY, prompt: 'Listen to this and reply in English.', attachments: ['./question.wav'], audio: { voice: 'alloy', format: 'wav' }, maxTokens: 512,});// text contains the transcript / text reply// audio output (spoken reply) is in the response parts when presentaudio.voice and audio.format are forwarded by the adapter into the audio object on the Chat Completions body. They have no effect on Google (which ignores the audio option when audio OUTPUT is not requested).
Your options
Section titled “Your options”AudioPart shape (ContentPart of type 'audio'):
| Field | Type | Description |
|---|---|---|
type | 'audio' | Discriminator — set by loadContent() when MIME starts with audio/. |
source | DataSource | Where the audio bytes come from (see below). |
DataSource variants for audio input:
type | Required fields | When to use |
|---|---|---|
'base64' | mimeType: string, data: string | Audio bytes encoded as base64 (no data-URL prefix). Output of loadContent(). |
'buffer' | mimeType: string, data: Uint8Array | Raw bytes in memory with explicit MIME. |
'path' | mimeType: string, path: string | Local file path (Node/Bun). SDK reads and encodes. Prefer attachments for simplicity. |
'url' | url: string | Remote URL. SDK fetches, encodes, sends as base64. |
'file' | fileId: string | Files API reference. Google accepts fileData with a URI; not supported by OpenAI audio input path. |
Audio MIME types auto-detected:
| Extension / Magic bytes | MIME type |
|---|---|
.wav / RIFF+WAVE header | audio/wav |
.mp3 / ID3 tag or FF Ex | audio/mpeg |
.m4a | audio/mp4 |
.ogg | audio/ogg |
.flac | audio/flac |
OpenAI audio option (output control when sending audio input):
| Field | Values | Notes |
|---|---|---|
voice | 'alloy', 'coral', 'shimmer', 'echo', or a voice alias ('neutral', 'warm', 'bright', 'deep') | Voice for the spoken audio reply. Aliases are resolved by the SDK. |
format | 'wav', 'mp3', 'opus', 'flac', 'pcm16' | Format of the spoken audio reply. 'aac' falls back to 'wav'. |
Provider support for audio input:
| Provider | Models | Supported audio formats | Notes |
|---|---|---|---|
| OpenAI (Chat Completions) | gpt-4o-audio-preview, gpt-4o-mini-audio-preview | WAV, MP3 only | Requires modalities: ['text', 'audio'] — the adapter sets this automatically when an AudioPart is detected. |
| Gemini 1.5+, Gemini 2.0+ | WAV, MP3, OGG, FLAC, M4A, many more | inlineData path; no extra flags needed. | |
| Anthropic | None | — | Audio input in complete() is not supported. Use transcribe() for STT. |
Audio input vs. transcribe() — which to use:
Audio input to complete() | transcribe() (article 18) | |
|---|---|---|
| What it does | Full multimodal model call: understand + respond | Dedicated STT endpoint: speech -> text only |
| Output | Text reply (+ optional audio reply on OpenAI) | { text } only |
| Cost | Per-token pricing for the full model | Per-minute pricing (OpenAI Whisper / gpt-4o-transcribe) |
| Best for | Reason about audio content, combine with other context | High-volume cheap transcription |
| Provider | OpenAI audio models, Google Gemini | OpenAI (dedicated endpoint), Google (via completion) |
Compare the SDKs
Section titled “Compare the SDKs”OpenAI Chat Completions requires three extra fields for audio input: modalities, audio.voice, and audio.format on the request body, plus a non-standard input_audio content block type. Google needs no extra flags — just inlineData. The ORXA adapter detects AudioPart in the message content and automatically sets modalities: ['text', 'audio'] on the OpenAI request; for Google it maps to inlineData. Your code remains the same for both providers.
Gotchas and next steps
Section titled “Gotchas and next steps”OpenAI audio models only accept WAV and MP3. The adapter coerces mimeType to wav for anything that does not contain mpeg or mp3. Send your audio as WAV or MP3 to be safe.
OpenAI always enables audio output when audio input is detected. When the adapter sees an AudioPart in the message content it adds modalities: ['text', 'audio'] to the body and requires a voice. The default voice is 'alloy'. If you only want transcription and not a spoken reply, use transcribe() instead.
Google Gemini audio support is broad. Linear PCM, WAV, MP3, AAC, AIFF, FLAC, OGG, and more are accepted. Duration limits vary by model (typically up to 8.4 hours of audio per request).
Cost is not per-minute here. Audio input to complete() is billed as tokens (the audio is tokenised internally). OpenAI charges roughly 1 token per 0.1s of audio. Google charges per token. Use transcribe() for bulk speech-to-text at per-minute rates.
Next steps:
- Speech to text (STT) — dedicated transcription endpoint, cheaper for bulk STT
- Text to speech (TTS) — generate audio from text
- Realtime — live bidirectional audio sessions