Speech to text (STT)

What you will achieve

Transcribe a WAV file saying 'hello' via transcribe() and assert the returned { text } matches /hello/i on OpenAI and Google.

When and why you need this

Use transcribe() when you need speech converted to text at scale or at low cost — call recordings, meeting notes, podcast captions, voice commands. It routes to a dedicated transcription endpoint (OpenAI) or a lightweight completion call (Google) and returns only { text }.

This is different from audio input to complete() (article 15). complete() with an AudioPart runs the full multimodal model — it can reason about the audio, combine it with text context, and return a thoughtful reply. transcribe() is purely speech-to-text: cheaper per minute, no reasoning, no conversation context.

The challenge with raw provider SDKs:

OpenAI has a dedicated endpoint POST /v1/audio/transcriptions that accepts a multipart form upload (file, model, optional language). The response is { text: string }. You must buffer the file, build the FormData, and parse the response yourself.
Google has no dedicated transcription endpoint. The adapter calls complete() with the audio inlined and a 'Transcribe this audio. Reply with only the spoken words.' system prompt, then returns the text response.

transcribe() routes to the right path and normalises the result to { text }.

Step by step

Step 1 — Transcribe a WAV file by path

import { transcribe } from '@combycode/llm-sdk';

const { text } = await transcribe({
  model: 'openai/whisper-1',
  apiKey: process.env.OPENAI_API_KEY,
  audio: './hello.wav',
});

console.log(text); // "hello"

Pass a file path as audio. The SDK reads the file with loadContent(), detects the MIME type, and sends it as a multipart form upload to /v1/audio/transcriptions. The returned text is the raw transcript.

Step 2 — Transcribe raw bytes

import { readFileSync } from 'fs';

const audioBytes = new Uint8Array(readFileSync('./recording.mp3'));

const { text } = await transcribe({
  model: 'openai/whisper-1',
  apiKey: process.env.OPENAI_API_KEY,
  audio: audioBytes,
});

audio accepts string (file path), Uint8Array (raw bytes — MIME defaults to audio/wav), or an AudioInput object for explicit MIME declaration. Bytes are uploaded as a multipart form; the filename extension is inferred from the MIME type (audio.wav, audio.mp3, etc.).

Step 3 — Declare MIME explicitly for non-WAV formats

import type { AudioInput } from '@combycode/llm-sdk';
import { readFileSync } from 'fs';

const bytes = new Uint8Array(readFileSync('./clip.mp3'));

const audio: AudioInput = {
  data: bytes,
  mimeType: 'audio/mpeg',
};

const { text } = await transcribe({
  model: 'openai/gpt-4o-transcribe',
  apiKey: process.env.OPENAI_API_KEY,
  audio,
});

When passing raw bytes, the MIME type defaults to audio/wav. Pass an AudioInput object with mimeType to override. The MIME controls both the Content-Type header in the multipart form and the inferred filename extension sent to the API.

Step 4 — Add a language hint

const { text } = await transcribe({
  model: 'openai/whisper-1',
  apiKey: process.env.OPENAI_API_KEY,
  audio: './recording-fr.wav',
  language: 'fr',
});

language is a BCP-47 language code (e.g. 'en', 'fr', 'de', 'ja'). For OpenAI it is sent as a language field in the multipart form, which can improve accuracy and speed for non-English audio. For Google it is passed in the completion prompt (not a dedicated field).

Step 5 — Provide audio duration for accurate cost tracking

const { text } = await transcribe({
  model: 'openai/whisper-1',
  apiKey: process.env.OPENAI_API_KEY,
  audio: './call.wav',
  audioDurationSeconds: 127,
});

OpenAI’s transcription endpoint does not return audio duration in the response, so the SDK cannot calculate cost without it. When audioDurationSeconds is omitted:

For WAV files, the SDK parses the RIFF header to derive duration from sampleRate, channels, and data chunk size — no extra input needed.
For all other formats (MP3, FLAC, OGG, etc.), the SDK emits a cost of zero with a log note explaining that duration is unknown.

Pass audioDurationSeconds for non-WAV audio to get accurate cost reporting from the onCostEntry hook.

Step 6 — Use Google for transcription

const { text } = await transcribe({
  model: 'google/gemini-2.0-flash',
  apiKey: process.env.GOOGLE_API_KEY,
  audio: './recording.wav',
});

For Google, transcribe() delegates to complete() internally: it attaches the audio as an AudioPart and uses the default prompt 'Transcribe this audio. Reply with only the spoken words.'. Override the prompt via prompt:

const { text } = await transcribe({
  model: 'google/gemini-2.0-flash',
  apiKey: process.env.GOOGLE_API_KEY,
  audio: './meeting.wav',
  prompt: 'Transcribe this meeting audio. Include speaker labels if you can identify them.',
});

Your options

TranscribeOptions — full parameter set:

Option	Type	Required	Description
`model`	`string`	Yes	Namespaced (`'openai/whisper-1'`) or bare with `provider`.
`provider`	`ProviderName`	When `model` is bare	`'openai'` or `'google'`.
`apiKey`	`string`	No	Falls back to `engine.apiKeys[provider]`.
`audio`	`string \| Uint8Array \| AudioInput`	Yes	File path, raw bytes, or `AudioInput` with explicit `mimeType`.
`language`	`string`	No	BCP-47 language code. Sent to OpenAI as `language` field.
`prompt`	`string`	No	Transcription prompt for Google (completion path). Ignored by OpenAI. Default: `'Transcribe this audio. Reply with only the spoken words.'`
`audioDurationSeconds`	`number`	No	Caller-supplied duration for cost calculation. Auto-derived from WAV header when absent. Other formats emit honest zero.
`engine`	`EngineHandle`	No	Share an existing engine instance.

AudioInput object:

Field	Type	Description
`data`	`Uint8Array \| string`	Raw audio bytes or a file path.
`mimeType`	`string \| undefined`	Explicit MIME type. Overrides detection.
`sampleRate`	`number \| undefined`	PCM sample rate hint (for raw/stream audio).

Audio formats accepted:

Format	MIME	OpenAI	Google
WAV	`audio/wav`	Yes	Yes
MP3	`audio/mpeg`	Yes	Yes
M4A / AAC	`audio/mp4`	Yes	Yes
OGG	`audio/ogg`	Yes	Yes
FLAC	`audio/flac`	Yes	Yes
WebM	`audio/webm`	Yes	Yes

OpenAI transcription models:

Model	Notes
`whisper-1`	Classic Whisper model. Per-minute pricing (~$0.006/min). Supports many languages.
`gpt-4o-transcribe`	GPT-4o based. Higher quality, especially for noisy audio or rare languages.
`gpt-4o-mini-transcribe`	Cheaper than `gpt-4o-transcribe`, better than `whisper-1` for most cases.

Cost tracking:

OpenAI transcription is priced per minute of audio (not per token). The SDK calls calculateTranscriptionCost() with the per-minute rate from the model catalog and emits onCostEntry so the cost collector tracks it. For WAV audio, duration is parsed automatically. For other formats, pass audioDurationSeconds. When duration is unavailable, cost is recorded as zero with a note in providerEvidence explaining why.

Google transcription is priced as a normal completion (per input/output token). Cost is tracked through the standard onCompletion hook.

transcribe() vs audio input to complete() — decision table:

	`transcribe()`	Audio input to `complete()`
Output	`{ text }` only	Full model reply (text + optional audio)
Cost	Per-minute (OpenAI) / per-token (Google completion)	Per-token (full model)
Context	No — single audio file	Yes — mix with text messages
Language support	All Whisper languages	Depends on model
Best for	Bulk STT, transcription pipelines	Reasoning about audio content

Compare the SDKs

import { transcribe } from '@combycode/llm-sdk';

// Unified STT. openai routes to the /v1/audio/transcriptions endpoint; google
// (a generateContent model) transcribes via a normal completion — one call either
// way. (Official samples each hit a provider-specific transcription path.)
const t0 = performance.now();
const { text } = await transcribe({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  audio: '../../official-samples/_fixtures/hello.wav',
});

console.log(JSON.stringify({ result: text.trim() || 'empty', ms: Math.round(performance.now() - t0) }));

OpenAI’s SDK requires manually building a FormData (or using toFile()), posting to /v1/audio/transcriptions, and reading response.text. For Google there is no STT method at all — you write a prompt, call generateContent, and parse the reply. ORXA calls transcribe() in both cases and returns { text }. WAV duration parsing for cost tracking and the Google completion fallback are handled internally.

Gotchas and next steps

WAV duration is parsed automatically; other formats are not. The SDK reads the RIFF header (sample rate, channel count, data chunk size) and computes duration without any external library. MP3, FLAC, OGG, and AAC have variable-length headers that require a dedicated parser — the SDK does not bundle one. Pass audioDurationSeconds for those formats.

File size limits apply. OpenAI’s transcription endpoint accepts up to 25 MB per file. For longer recordings, chunk the audio into 25 MB segments before transcribing. Google’s inline limit for audio is 20 MB (use the Files API for larger).

prompt is ignored by OpenAI in transcribe(). The OpenAI transcription endpoint does accept an optional prompt parameter for vocabulary hints, but the current transcribe() implementation does not forward it. Override via engine.apiKeys if needed, or call the adapter directly for that level of control.

Google quality depends on the model. gemini-2.0-flash is fast and cheap; gemini-2.0-pro is more accurate for difficult audio (accents, background noise). Both use the completion path — transcription quality scales with model capability.

For real-time transcription, use the Realtime API. transcribe() is batch-oriented (a complete audio file in, a transcript out). For live microphone input, use Realtime.

Next steps:

Audio input — send audio to a multimodal model for understanding (not just transcription)
TTS — generate audio from text
Realtime — live audio streaming sessions