Audio input

▶ Try in Sandbox Opens a live chat playground with this example prefilled — add your API key and attach an audio file, then hit Send. Runs in your browser; no code is executed.

What you will achieve

Send a WAV file saying 'hello', prompt 'Transcribe this audio. Reply with only the spoken words.', and assert the response matches /hello/i on OpenAI (Chat Completions audio model) and Google Gemini. Anthropic does not support audio input in complete().

When and why you need this

Use audio input to a chat model when you want the model to reason about the audio alongside other context — combining speech recognition, intent detection, sentiment analysis, and a text response in one call. Examples: classify a support call recording, answer a question spoken in an audio clip, or transcribe-and-reply in one step.

This is different from transcribe() (article 18). transcribe() calls a dedicated speech-to-text endpoint (OpenAI /v1/audio/transcriptions); it is cheaper and returns only text. Audio input to complete() runs the full multimodal model — more capable, higher cost per second of audio.

The challenge with raw provider SDKs:

OpenAI Chat Completions (audio models) requires modalities: ['text', 'audio'] on the request body, an audio object with voice and format, and the audio itself as { type: 'input_audio', input_audio: { data, format } } — a completely separate content block from images.
Google accepts audio as an inlineData part with the audio MIME type, exactly like images. No extra modality flags.

The attachments API handles both paths with one call.

Step by step

Step 1 — Send a WAV file by path

import { complete } from '@combycode/llm-sdk';

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'Transcribe this audio. Reply with only the spoken words.',
  attachments: ['./hello.wav'],
  maxTokens: 256,
});

console.log(text); // "hello"

loadContent() detects audio/wav from the .wav extension (or from the RIFF+WAVE magic bytes), base64-encodes the file, and returns an AudioPart with a base64 DataSource. The provider adapter then uses the correct wire shape.

Step 2 — Pass raw audio bytes

import { readFileSync } from 'fs';

const audioBytes = new Uint8Array(readFileSync('./recording.wav'));

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'What language is being spoken in this recording?',
  attachments: [audioBytes],
  maxTokens: 64,
});

MIME is detected from the leading bytes: RIFF+WAVE -> audio/wav, ID3/FF Fx -> audio/mpeg. For formats without a magic-byte signature (Opus, FLAC, AAC), use the file-path form so the extension can be used for MIME detection.

Step 3 — Build an AudioPart manually

When you need to supply an audio format that the extension-based MIME sniff would not detect, build the AudioPart directly:

import { complete } from '@combycode/llm-sdk';
import type { AudioPart } from '@combycode/llm-sdk';
import { readFileSync } from 'fs';
import { Buffer } from 'buffer';

const raw = new Uint8Array(readFileSync('./clip.mp3'));
const b64 = Buffer.from(raw).toString('base64');

const audioPart: AudioPart = {
  type: 'audio',
  source: { type: 'base64', mimeType: 'audio/mpeg', data: b64 },
};

const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Summarise what is said in this clip.' },
        audioPart,
      ],
    },
  ],
  maxTokens: 256,
});

Step 4 — Control the output voice for OpenAI audio models

OpenAI’s audio-capable Chat Completions models (e.g. gpt-4o-audio-preview) can return both a text transcript and an audio reply in the same call. You control the output voice and format via audio:

const { text } = await complete({
  model: 'openai/gpt-4o-audio-preview',
  apiKey: process.env.OPENAI_API_KEY,
  prompt: 'Listen to this and reply in English.',
  attachments: ['./question.wav'],
  audio: { voice: 'alloy', format: 'wav' },
  maxTokens: 512,
});
// text contains the transcript / text reply
// audio output (spoken reply) is in the response parts when present

audio.voice and audio.format are forwarded by the adapter into the audio object on the Chat Completions body. They have no effect on Google (which ignores the audio option when audio OUTPUT is not requested).

Your options

AudioPart shape (ContentPart of type 'audio'):

Field	Type	Description
`type`	`'audio'`	Discriminator — set by `loadContent()` when MIME starts with `audio/`.
`source`	`DataSource`	Where the audio bytes come from (see below).

DataSource variants for audio input:

`type`	Required fields	When to use
`'base64'`	`mimeType: string`, `data: string`	Audio bytes encoded as base64 (no data-URL prefix). Output of `loadContent()`.
`'buffer'`	`mimeType: string`, `data: Uint8Array`	Raw bytes in memory with explicit MIME.
`'path'`	`mimeType: string`, `path: string`	Local file path (Node/Bun). SDK reads and encodes. Prefer `attachments` for simplicity.
`'url'`	`url: string`	Remote URL. SDK fetches, encodes, sends as `base64`.
`'file'`	`fileId: string`	Files API reference. Google accepts `fileData` with a URI; not supported by OpenAI audio input path.

Audio MIME types auto-detected:

Extension / Magic bytes	MIME type
`.wav` / RIFF+WAVE header	`audio/wav`
`.mp3` / ID3 tag or FF Ex	`audio/mpeg`
`.m4a`	`audio/mp4`
`.ogg`	`audio/ogg`
`.flac`	`audio/flac`

OpenAI audio option (output control when sending audio input):

Field	Values	Notes
`voice`	`'alloy'`, `'coral'`, `'shimmer'`, `'echo'`, or a voice alias (`'neutral'`, `'warm'`, `'bright'`, `'deep'`)	Voice for the spoken audio reply. Aliases are resolved by the SDK.
`format`	`'wav'`, `'mp3'`, `'opus'`, `'flac'`, `'pcm16'`	Format of the spoken audio reply. `'aac'` falls back to `'wav'`.

Provider support for audio input:

Provider	Models	Supported audio formats	Notes
OpenAI (Chat Completions)	`gpt-4o-audio-preview`, `gpt-4o-mini-audio-preview`	WAV, MP3 only	Requires `modalities: ['text', 'audio']` — the adapter sets this automatically when an `AudioPart` is detected.
Google	Gemini 1.5+, Gemini 2.0+	WAV, MP3, OGG, FLAC, M4A, many more	`inlineData` path; no extra flags needed.
Anthropic	None	—	Audio input in `complete()` is not supported. Use `transcribe()` for STT.

Audio input vs. transcribe() — which to use:

	Audio input to `complete()`	`transcribe()` (article 18)
What it does	Full multimodal model call: understand + respond	Dedicated STT endpoint: speech -> text only
Output	Text reply (+ optional audio reply on OpenAI)	`{ text }` only
Cost	Per-token pricing for the full model	Per-minute pricing (OpenAI Whisper / gpt-4o-transcribe)
Best for	Reason about audio content, combine with other context	High-volume cheap transcription
Provider	OpenAI audio models, Google Gemini	OpenAI (dedicated endpoint), Google (via completion)

Compare the SDKs

import { complete } from '@combycode/llm-sdk';

// Audio understanding via the unified attachments API — loadContent() detects the
// wav by MIME and attaches it as input audio. (Official openai needs modalities:
// ['text','audio'] + audio config and reads message.audio.transcript.)
const t0 = performance.now();
const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: 'Transcribe this audio. Reply with only the spoken words.',
  attachments: ['../../official-samples/_fixtures/hello.wav'],
  maxTokens: 256,
});
console.log(JSON.stringify({ result: text.trim() || 'empty', ms: Math.round(performance.now() - t0) }));

OpenAI Chat Completions requires three extra fields for audio input: modalities, audio.voice, and audio.format on the request body, plus a non-standard input_audio content block type. Google needs no extra flags — just inlineData. The ORXA adapter detects AudioPart in the message content and automatically sets modalities: ['text', 'audio'] on the OpenAI request; for Google it maps to inlineData. Your code remains the same for both providers.

Gotchas and next steps

OpenAI audio models only accept WAV and MP3. The adapter coerces mimeType to wav for anything that does not contain mpeg or mp3. Send your audio as WAV or MP3 to be safe.

OpenAI always enables audio output when audio input is detected. When the adapter sees an AudioPart in the message content it adds modalities: ['text', 'audio'] to the body and requires a voice. The default voice is 'alloy'. If you only want transcription and not a spoken reply, use transcribe() instead.

Google Gemini audio support is broad. Linear PCM, WAV, MP3, AAC, AIFF, FLAC, OGG, and more are accepted. Duration limits vary by model (typically up to 8.4 hours of audio per request).

Cost is not per-minute here. Audio input to complete() is billed as tokens (the audio is tokenised internally). OpenAI charges roughly 1 token per 0.1s of audio. Google charges per token. Use transcribe() for bulk speech-to-text at per-minute rates.

Next steps:

Speech to text (STT) — dedicated transcription endpoint, cheaper for bulk STT
Text to speech (TTS) — generate audio from text
Realtime — live bidirectional audio sessions