Skip to content

Speech to text (STT)

Transcribe a WAV file saying 'hello' via transcribe() and assert the returned { text } matches /hello/i on OpenAI and Google.

Use transcribe() when you need speech converted to text at scale or at low cost — call recordings, meeting notes, podcast captions, voice commands. It routes to a dedicated transcription endpoint (OpenAI) or a lightweight completion call (Google) and returns only { text }.

This is different from audio input to complete() (article 15). complete() with an AudioPart runs the full multimodal model — it can reason about the audio, combine it with text context, and return a thoughtful reply. transcribe() is purely speech-to-text: cheaper per minute, no reasoning, no conversation context.

The challenge with raw provider SDKs:

  • OpenAI has a dedicated endpoint POST /v1/audio/transcriptions that accepts a multipart form upload (file, model, optional language). The response is { text: string }. You must buffer the file, build the FormData, and parse the response yourself.
  • Google has no dedicated transcription endpoint. The adapter calls complete() with the audio inlined and a 'Transcribe this audio. Reply with only the spoken words.' system prompt, then returns the text response.

transcribe() routes to the right path and normalises the result to { text }.

import { transcribe } from '@combycode/llm-sdk';
const { text } = await transcribe({
model: 'openai/whisper-1',
apiKey: process.env.OPENAI_API_KEY,
audio: './hello.wav',
});
console.log(text); // "hello"

Pass a file path as audio. The SDK reads the file with loadContent(), detects the MIME type, and sends it as a multipart form upload to /v1/audio/transcriptions. The returned text is the raw transcript.

import { readFileSync } from 'fs';
const audioBytes = new Uint8Array(readFileSync('./recording.mp3'));
const { text } = await transcribe({
model: 'openai/whisper-1',
apiKey: process.env.OPENAI_API_KEY,
audio: audioBytes,
});

audio accepts string (file path), Uint8Array (raw bytes — MIME defaults to audio/wav), or an AudioInput object for explicit MIME declaration. Bytes are uploaded as a multipart form; the filename extension is inferred from the MIME type (audio.wav, audio.mp3, etc.).

Step 3 — Declare MIME explicitly for non-WAV formats

Section titled “Step 3 — Declare MIME explicitly for non-WAV formats”
import type { AudioInput } from '@combycode/llm-sdk';
import { readFileSync } from 'fs';
const bytes = new Uint8Array(readFileSync('./clip.mp3'));
const audio: AudioInput = {
data: bytes,
mimeType: 'audio/mpeg',
};
const { text } = await transcribe({
model: 'openai/gpt-4o-transcribe',
apiKey: process.env.OPENAI_API_KEY,
audio,
});

When passing raw bytes, the MIME type defaults to audio/wav. Pass an AudioInput object with mimeType to override. The MIME controls both the Content-Type header in the multipart form and the inferred filename extension sent to the API.

const { text } = await transcribe({
model: 'openai/whisper-1',
apiKey: process.env.OPENAI_API_KEY,
audio: './recording-fr.wav',
language: 'fr',
});

language is a BCP-47 language code (e.g. 'en', 'fr', 'de', 'ja'). For OpenAI it is sent as a language field in the multipart form, which can improve accuracy and speed for non-English audio. For Google it is passed in the completion prompt (not a dedicated field).

Step 5 — Provide audio duration for accurate cost tracking

Section titled “Step 5 — Provide audio duration for accurate cost tracking”
const { text } = await transcribe({
model: 'openai/whisper-1',
apiKey: process.env.OPENAI_API_KEY,
audio: './call.wav',
audioDurationSeconds: 127,
});

OpenAI’s transcription endpoint does not return audio duration in the response, so the SDK cannot calculate cost without it. When audioDurationSeconds is omitted:

  • For WAV files, the SDK parses the RIFF header to derive duration from sampleRate, channels, and data chunk size — no extra input needed.
  • For all other formats (MP3, FLAC, OGG, etc.), the SDK emits a cost of zero with a log note explaining that duration is unknown.

Pass audioDurationSeconds for non-WAV audio to get accurate cost reporting from the onCostEntry hook.

const { text } = await transcribe({
model: 'google/gemini-2.0-flash',
apiKey: process.env.GOOGLE_API_KEY,
audio: './recording.wav',
});

For Google, transcribe() delegates to complete() internally: it attaches the audio as an AudioPart and uses the default prompt 'Transcribe this audio. Reply with only the spoken words.'. Override the prompt via prompt:

const { text } = await transcribe({
model: 'google/gemini-2.0-flash',
apiKey: process.env.GOOGLE_API_KEY,
audio: './meeting.wav',
prompt: 'Transcribe this meeting audio. Include speaker labels if you can identify them.',
});

TranscribeOptions — full parameter set:

OptionTypeRequiredDescription
modelstringYesNamespaced ('openai/whisper-1') or bare with provider.
providerProviderNameWhen model is bare'openai' or 'google'.
apiKeystringNoFalls back to engine.apiKeys[provider].
audiostring | Uint8Array | AudioInputYesFile path, raw bytes, or AudioInput with explicit mimeType.
languagestringNoBCP-47 language code. Sent to OpenAI as language field.
promptstringNoTranscription prompt for Google (completion path). Ignored by OpenAI. Default: 'Transcribe this audio. Reply with only the spoken words.'
audioDurationSecondsnumberNoCaller-supplied duration for cost calculation. Auto-derived from WAV header when absent. Other formats emit honest zero.
engineEngineHandleNoShare an existing engine instance.

AudioInput object:

FieldTypeDescription
dataUint8Array | stringRaw audio bytes or a file path.
mimeTypestring | undefinedExplicit MIME type. Overrides detection.
sampleRatenumber | undefinedPCM sample rate hint (for raw/stream audio).

Audio formats accepted:

FormatMIMEOpenAIGoogle
WAVaudio/wavYesYes
MP3audio/mpegYesYes
M4A / AACaudio/mp4YesYes
OGGaudio/oggYesYes
FLACaudio/flacYesYes
WebMaudio/webmYesYes

OpenAI transcription models:

ModelNotes
whisper-1Classic Whisper model. Per-minute pricing (~$0.006/min). Supports many languages.
gpt-4o-transcribeGPT-4o based. Higher quality, especially for noisy audio or rare languages.
gpt-4o-mini-transcribeCheaper than gpt-4o-transcribe, better than whisper-1 for most cases.

Cost tracking:

OpenAI transcription is priced per minute of audio (not per token). The SDK calls calculateTranscriptionCost() with the per-minute rate from the model catalog and emits onCostEntry so the cost collector tracks it. For WAV audio, duration is parsed automatically. For other formats, pass audioDurationSeconds. When duration is unavailable, cost is recorded as zero with a note in providerEvidence explaining why.

Google transcription is priced as a normal completion (per input/output token). Cost is tracked through the standard onCompletion hook.

transcribe() vs audio input to complete() — decision table:

transcribe()Audio input to complete()
Output{ text } onlyFull model reply (text + optional audio)
CostPer-minute (OpenAI) / per-token (Google completion)Per-token (full model)
ContextNo — single audio fileYes — mix with text messages
Language supportAll Whisper languagesDepends on model
Best forBulk STT, transcription pipelinesReasoning about audio content
import { transcribe } from '@combycode/llm-sdk';

// Unified STT. openai routes to the /v1/audio/transcriptions endpoint; google
// (a generateContent model) transcribes via a normal completion — one call either
// way. (Official samples each hit a provider-specific transcription path.)
const t0 = performance.now();
const { text } = await transcribe({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  audio: '../../official-samples/_fixtures/hello.wav',
});

console.log(JSON.stringify({ result: text.trim() || 'empty', ms: Math.round(performance.now() - t0) }));

OpenAI’s SDK requires manually building a FormData (or using toFile()), posting to /v1/audio/transcriptions, and reading response.text. For Google there is no STT method at all — you write a prompt, call generateContent, and parse the reply. ORXA calls transcribe() in both cases and returns { text }. WAV duration parsing for cost tracking and the Google completion fallback are handled internally.

WAV duration is parsed automatically; other formats are not. The SDK reads the RIFF header (sample rate, channel count, data chunk size) and computes duration without any external library. MP3, FLAC, OGG, and AAC have variable-length headers that require a dedicated parser — the SDK does not bundle one. Pass audioDurationSeconds for those formats.

File size limits apply. OpenAI’s transcription endpoint accepts up to 25 MB per file. For longer recordings, chunk the audio into 25 MB segments before transcribing. Google’s inline limit for audio is 20 MB (use the Files API for larger).

prompt is ignored by OpenAI in transcribe(). The OpenAI transcription endpoint does accept an optional prompt parameter for vocabulary hints, but the current transcribe() implementation does not forward it. Override via engine.apiKeys if needed, or call the adapter directly for that level of control.

Google quality depends on the model. gemini-2.0-flash is fast and cheap; gemini-2.0-pro is more accurate for difficult audio (accents, background noise). Both use the completion path — transcription quality scales with model capability.

For real-time transcription, use the Realtime API. transcribe() is batch-oriented (a complete audio file in, a transcript out). For live microphone input, use Realtime.

Next steps:

  • Audio input — send audio to a multimodal model for understanding (not just transcription)
  • TTS — generate audio from text
  • Realtime — live audio streaming sessions