Speech to text (STT)
What you will achieve
Section titled “What you will achieve”Transcribe a WAV file saying 'hello' via transcribe() and assert the returned { text } matches /hello/i on OpenAI and Google.
When and why you need this
Section titled “When and why you need this”Use transcribe() when you need speech converted to text at scale or at low cost — call recordings, meeting notes, podcast captions, voice commands. It routes to a dedicated transcription endpoint (OpenAI) or a lightweight completion call (Google) and returns only { text }.
This is different from audio input to complete() (article 15). complete() with an AudioPart runs the full multimodal model — it can reason about the audio, combine it with text context, and return a thoughtful reply. transcribe() is purely speech-to-text: cheaper per minute, no reasoning, no conversation context.
The challenge with raw provider SDKs:
- OpenAI has a dedicated endpoint
POST /v1/audio/transcriptionsthat accepts a multipart form upload (file,model, optionallanguage). The response is{ text: string }. You must buffer the file, build theFormData, and parse the response yourself. - Google has no dedicated transcription endpoint. The adapter calls
complete()with the audio inlined and a'Transcribe this audio. Reply with only the spoken words.'system prompt, then returns the text response.
transcribe() routes to the right path and normalises the result to { text }.
Step by step
Section titled “Step by step”Step 1 — Transcribe a WAV file by path
Section titled “Step 1 — Transcribe a WAV file by path”import { transcribe } from '@combycode/llm-sdk';
const { text } = await transcribe({ model: 'openai/whisper-1', apiKey: process.env.OPENAI_API_KEY, audio: './hello.wav',});
console.log(text); // "hello"Pass a file path as audio. The SDK reads the file with loadContent(), detects the MIME type, and sends it as a multipart form upload to /v1/audio/transcriptions. The returned text is the raw transcript.
Step 2 — Transcribe raw bytes
Section titled “Step 2 — Transcribe raw bytes”import { readFileSync } from 'fs';
const audioBytes = new Uint8Array(readFileSync('./recording.mp3'));
const { text } = await transcribe({ model: 'openai/whisper-1', apiKey: process.env.OPENAI_API_KEY, audio: audioBytes,});audio accepts string (file path), Uint8Array (raw bytes — MIME defaults to audio/wav), or an AudioInput object for explicit MIME declaration. Bytes are uploaded as a multipart form; the filename extension is inferred from the MIME type (audio.wav, audio.mp3, etc.).
Step 3 — Declare MIME explicitly for non-WAV formats
Section titled “Step 3 — Declare MIME explicitly for non-WAV formats”import type { AudioInput } from '@combycode/llm-sdk';import { readFileSync } from 'fs';
const bytes = new Uint8Array(readFileSync('./clip.mp3'));
const audio: AudioInput = { data: bytes, mimeType: 'audio/mpeg',};
const { text } = await transcribe({ model: 'openai/gpt-4o-transcribe', apiKey: process.env.OPENAI_API_KEY, audio,});When passing raw bytes, the MIME type defaults to audio/wav. Pass an AudioInput object with mimeType to override. The MIME controls both the Content-Type header in the multipart form and the inferred filename extension sent to the API.
Step 4 — Add a language hint
Section titled “Step 4 — Add a language hint”const { text } = await transcribe({ model: 'openai/whisper-1', apiKey: process.env.OPENAI_API_KEY, audio: './recording-fr.wav', language: 'fr',});language is a BCP-47 language code (e.g. 'en', 'fr', 'de', 'ja'). For OpenAI it is sent as a language field in the multipart form, which can improve accuracy and speed for non-English audio. For Google it is passed in the completion prompt (not a dedicated field).
Step 5 — Provide audio duration for accurate cost tracking
Section titled “Step 5 — Provide audio duration for accurate cost tracking”const { text } = await transcribe({ model: 'openai/whisper-1', apiKey: process.env.OPENAI_API_KEY, audio: './call.wav', audioDurationSeconds: 127,});OpenAI’s transcription endpoint does not return audio duration in the response, so the SDK cannot calculate cost without it. When audioDurationSeconds is omitted:
- For WAV files, the SDK parses the RIFF header to derive duration from
sampleRate,channels, anddatachunk size — no extra input needed. - For all other formats (MP3, FLAC, OGG, etc.), the SDK emits a cost of zero with a log note explaining that duration is unknown.
Pass audioDurationSeconds for non-WAV audio to get accurate cost reporting from the onCostEntry hook.
Step 6 — Use Google for transcription
Section titled “Step 6 — Use Google for transcription”const { text } = await transcribe({ model: 'google/gemini-2.0-flash', apiKey: process.env.GOOGLE_API_KEY, audio: './recording.wav',});For Google, transcribe() delegates to complete() internally: it attaches the audio as an AudioPart and uses the default prompt 'Transcribe this audio. Reply with only the spoken words.'. Override the prompt via prompt:
const { text } = await transcribe({ model: 'google/gemini-2.0-flash', apiKey: process.env.GOOGLE_API_KEY, audio: './meeting.wav', prompt: 'Transcribe this meeting audio. Include speaker labels if you can identify them.',});Your options
Section titled “Your options”TranscribeOptions — full parameter set:
| Option | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Namespaced ('openai/whisper-1') or bare with provider. |
provider | ProviderName | When model is bare | 'openai' or 'google'. |
apiKey | string | No | Falls back to engine.apiKeys[provider]. |
audio | string | Uint8Array | AudioInput | Yes | File path, raw bytes, or AudioInput with explicit mimeType. |
language | string | No | BCP-47 language code. Sent to OpenAI as language field. |
prompt | string | No | Transcription prompt for Google (completion path). Ignored by OpenAI. Default: 'Transcribe this audio. Reply with only the spoken words.' |
audioDurationSeconds | number | No | Caller-supplied duration for cost calculation. Auto-derived from WAV header when absent. Other formats emit honest zero. |
engine | EngineHandle | No | Share an existing engine instance. |
AudioInput object:
| Field | Type | Description |
|---|---|---|
data | Uint8Array | string | Raw audio bytes or a file path. |
mimeType | string | undefined | Explicit MIME type. Overrides detection. |
sampleRate | number | undefined | PCM sample rate hint (for raw/stream audio). |
Audio formats accepted:
| Format | MIME | OpenAI | |
|---|---|---|---|
| WAV | audio/wav | Yes | Yes |
| MP3 | audio/mpeg | Yes | Yes |
| M4A / AAC | audio/mp4 | Yes | Yes |
| OGG | audio/ogg | Yes | Yes |
| FLAC | audio/flac | Yes | Yes |
| WebM | audio/webm | Yes | Yes |
OpenAI transcription models:
| Model | Notes |
|---|---|
whisper-1 | Classic Whisper model. Per-minute pricing (~$0.006/min). Supports many languages. |
gpt-4o-transcribe | GPT-4o based. Higher quality, especially for noisy audio or rare languages. |
gpt-4o-mini-transcribe | Cheaper than gpt-4o-transcribe, better than whisper-1 for most cases. |
Cost tracking:
OpenAI transcription is priced per minute of audio (not per token). The SDK calls calculateTranscriptionCost() with the per-minute rate from the model catalog and emits onCostEntry so the cost collector tracks it. For WAV audio, duration is parsed automatically. For other formats, pass audioDurationSeconds. When duration is unavailable, cost is recorded as zero with a note in providerEvidence explaining why.
Google transcription is priced as a normal completion (per input/output token). Cost is tracked through the standard onCompletion hook.
transcribe() vs audio input to complete() — decision table:
transcribe() | Audio input to complete() | |
|---|---|---|
| Output | { text } only | Full model reply (text + optional audio) |
| Cost | Per-minute (OpenAI) / per-token (Google completion) | Per-token (full model) |
| Context | No — single audio file | Yes — mix with text messages |
| Language support | All Whisper languages | Depends on model |
| Best for | Bulk STT, transcription pipelines | Reasoning about audio content |
Compare the SDKs
Section titled “Compare the SDKs”OpenAI’s SDK requires manually building a FormData (or using toFile()), posting to /v1/audio/transcriptions, and reading response.text. For Google there is no STT method at all — you write a prompt, call generateContent, and parse the reply. ORXA calls transcribe() in both cases and returns { text }. WAV duration parsing for cost tracking and the Google completion fallback are handled internally.
Gotchas and next steps
Section titled “Gotchas and next steps”WAV duration is parsed automatically; other formats are not. The SDK reads the RIFF header (sample rate, channel count, data chunk size) and computes duration without any external library. MP3, FLAC, OGG, and AAC have variable-length headers that require a dedicated parser — the SDK does not bundle one. Pass audioDurationSeconds for those formats.
File size limits apply. OpenAI’s transcription endpoint accepts up to 25 MB per file. For longer recordings, chunk the audio into 25 MB segments before transcribing. Google’s inline limit for audio is 20 MB (use the Files API for larger).
prompt is ignored by OpenAI in transcribe(). The OpenAI transcription endpoint does accept an optional prompt parameter for vocabulary hints, but the current transcribe() implementation does not forward it. Override via engine.apiKeys if needed, or call the adapter directly for that level of control.
Google quality depends on the model. gemini-2.0-flash is fast and cheap; gemini-2.0-pro is more accurate for difficult audio (accents, background noise). Both use the completion path — transcription quality scales with model capability.
For real-time transcription, use the Realtime API. transcribe() is batch-oriented (a complete audio file in, a transcript out). For live microphone input, use Realtime.
Next steps:
- Audio input — send audio to a multimodal model for understanding (not just transcription)
- TTS — generate audio from text
- Realtime — live audio streaming sessions