Skip to content

Realtime / live session

Open a realtime session, send 'Say PING', and assert a text or audio response arrives — same createRealtime() API on OpenAI and Google (Anthropic has no realtime API).

OpenAI realtime uses OpenAIRealtimeWebSocket from openai/beta/realtime. Google Live uses ai.live.connect() returning an AsyncSession with a completely different event model (receive() async generator vs typed event emitters). The two are incompatible — separate integrations for each.

createRealtime() normalises both providers onto one event model (open, text, audio, turnComplete, error):

import { createRealtime } from '@combycode/llm-sdk';
const session = createRealtime({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
modalities: ['text'],
});
session.on('open', () => session.send({ text: 'Say PING' }));
session.on('text', (e) => console.log(e.delta));
session.on('turnComplete', () => session.close());
import { createRealtime } from '@combycode/llm-sdk';

// ONE unified realtime session across providers. Official samples need a separate
// per-SDK file with provider-specific event wiring (OpenAIRealtimeWebSocket vs
// ai.live.connect); here createRealtime() normalizes both onto the same
// open/send/text/audio/turnComplete event model. Gemini Live is audio-native, so
// google streams audio bytes where openai streams text — both surface uniformly.
// Gemini Live models are audio-native (they stream audio, not text); OpenAI
// realtime returns text. Request each provider's natural modality — the session
// API is identical either way, only this one hint differs.
const provider = (process.env.LLM_MODEL ?? '').split('/')[0];
const modalities: Array<'text' | 'audio'> = provider === 'google' ? ['audio'] : ['text'];

const t0 = performance.now();
const result = await new Promise<string>((resolve) => {
  let text = '';
  let bytes = 0;
  const session = createRealtime({
    model: process.env.LLM_MODEL!,
    apiKey: process.env.LLM_API_KEY,
    modalities,
  });
  const out = () => text.trim() || (bytes > 0 ? `audio:${bytes}` : '');
  const finish = (v: string) => {
    try {
      session.close();
    } catch {}
    resolve(v);
  };
  const timer = setTimeout(() => finish(out() || 'timeout'), 30000);
  session.on('open', () => session.send({ text: 'Say PING' }));
  session.on('text', (e) => {
    text += e.delta;
  });
  session.on('audio', (e) => {
    bytes += e.chunk.length;
  });
  session.on('turnComplete', () => {
    clearTimeout(timer);
    finish(out() || 'empty');
  });
  session.on('error', () => {
    clearTimeout(timer);
    finish(out() || 'error');
  });
});

console.log(JSON.stringify({ result: result || 'empty', ms: Math.round(performance.now() - t0) }));

createRealtime() opens a WebSocket to the provider’s live endpoint. Incoming events are normalised: OpenAI response.audio_transcript.delta and Google serverContent.modelTurn.parts both surface as { type: 'text', delta }. Google Gemini Live is audio-native; when modalities: ['audio'] is requested, audio chunks surface as { type: 'audio', chunk }. The turnComplete event fires when the provider signals end-of-turn.

  • Realtime guide — session options, interruption, audio format configuration
  • Audio input — non-realtime audio understanding