Skip to content

Server-side conversation state

Send turn 1 ('Remember the number 42.'), capture the server-state id, then send turn 2 ('What number?') with NO prior messages and confirm the model recalls 42.

In standard client-side history you resend the full conversation on every turn. For long conversations this means:

  • Growing cost — input tokens increase each turn, even for context the model has already processed.
  • Growing latency — more tokens to transmit and process on each request.
  • Bandwidth — the transcript travels over the wire every single turn.

OpenAI’s Responses API and xAI’s Interactions API both support server-side state: the provider stores the conversation on their servers and you send only a previous_response_id on subsequent turns. The provider reconstructs context from its server cache and combines it with just the new user message. You pay for new tokens only.

Anthropic and Google do not offer this feature — they always require the full history.

Step 1 — Create an LLMClient for a stateful provider

Section titled “Step 1 — Create an LLMClient for a stateful provider”
import { createLLM, type Message } from '@combycode/llm-sdk';
const llm = createLLM({
model: process.env.LLM_MODEL!, // e.g. 'openai/gpt-4o' or 'xai/grok-3'
apiKey: process.env.LLM_API_KEY,
});

createLLM() automatically detects which API type the model uses. OpenAI models use the Responses API (api: 'responses'); xAI models use the Interactions API (api: 'interactions'). You do not configure this manually.

Step 2 — Send the first turn and capture the assistant message

Section titled “Step 2 — Send the first turn and capture the assistant message”
const messages: Message[] = [
{ role: 'user', content: 'Remember the number 42.' },
];
const r1 = await llm.complete(messages);
// assistantMessage() stamps the server-state id (response_id / interaction_id)
// into the message's `origin.serverStateId` field.
messages.push(llm.assistantMessage(r1));

llm.assistantMessage(r1) does two things:

  1. Creates a role: 'assistant' message with the model’s text.
  2. When the client is on a stateful API (Responses or Interactions), embeds r1.id into origin.serverStateId on the message.

Without this step the next turn does not have the id needed to continue server-side.

Step 3 — Send the second turn — only the new message

Section titled “Step 3 — Send the second turn — only the new message”
messages.push({ role: 'user', content: 'What number did I ask you to remember?' });
// The SDK detects origin.serverStateId in the last assistant message,
// extracts it as previousResponseId, and sends only the new user message.
const r2 = await llm.complete(messages);
console.log(r2.text); // 'You asked me to remember 42.'

You pass the full messages array but the SDK decides what to actually send. When it finds a usable serverStateId in the most-recent assistant message (same provider, model within the TTL window), it sends only previousResponseId + the new user message. The provider reconstructs the rest from its cache.

The decision is automatic but observable. On the response object:

console.log(r2.id); // server-side response id for the next turn
console.log(r2.usage); // input tokens will be much lower on turn 2+

On a non-stateful provider (Anthropic, Google) the same code still works — the SDK transparently falls back to sending the full history. No code change needed when you run the same application against a different provider.

To always send full history regardless of provider:

const r2 = await llm.complete(messages, { stateful: false });

stateful: false disables the server-state optimisation for this call. Use it when:

  • You are debugging and want to confirm what history the model is actually using.
  • Your provider has a server-state bug and you need a workaround.
  • You are doing a capability test that requires full-history semantics.

Step 6 — Pass an explicit previousResponseId

Section titled “Step 6 — Pass an explicit previousResponseId”

You can also manage the id yourself:

const r1 = await llm.complete([{ role: 'user', content: 'Set x = 7.' }]);
const stateId = r1.id;
// Later -- just the new message + explicit id, no history array needed:
const r2 = await llm.complete(
[{ role: 'user', content: 'What is x?' }],
{ previousResponseId: stateId },
);

When you set previousResponseId manually the SDK uses it verbatim and skips the automatic detection logic. This is useful when you persist state ids to a database and restore them across sessions.

Option / fieldWhereBehaviour
stateful: trueDefaultSDK auto-detects server-state id in the last assistant message and optimises the send.
stateful: falseExecuteOptionsAlways send full history. No server-state optimisation. Works on all providers.
previousResponseIdExecuteOptionsManual: pass the id explicitly. SDK uses it verbatim; skips auto-detection.
llm.assistantMessage(r)LLMClient methodCreates the assistant Message with origin.serverStateId embedded. Required for auto-detection to work on the next turn.

Server-state availability:

Provider / APIServer-state supportId field
OpenAI Responses APIYesresponse_id
xAI Interactions APIYesinteraction_id
Anthropic Messages APINo — full history always
Google Generative AINo — full history always

The SDK’s fallback (full history on non-stateful providers) means your code is portable: remove the provider prefix from model and the conversation still works, just without the bandwidth/cost savings.

TTL and id expiry: server-side state ids expire (24 hours on OpenAI at the time of writing). If you store an id and replay it after the TTL the provider returns an error. The SDK does not retry automatically — it propagates the provider error so you can handle it (e.g. by resending full history).

import { createLLM, type Message } from '@combycode/llm-sdk';

// Server-state is ON by default. Where the provider supports it (OpenAI/xAI
// Responses), turn 2 sends ONLY the prior response id + the new turn — the SDK
// drops the transcript. `assistantMessage()` stamps the response into history
// with the server id; the brain decides id-vs-history (provider/model/TTL).
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const t0 = performance.now();
const messages: Message[] = [{ role: 'user', content: 'Remember the number 42.' }];
const r1 = await llm.complete(messages);
messages.push(llm.assistantMessage(r1));
messages.push({ role: 'user', content: 'What number did I ask you to remember? Reply with just the number.' });
const r2 = await llm.complete(messages);

console.log(JSON.stringify({ result: r2.text.trim(), ms: Math.round(performance.now() - t0) }));

The structural difference: OpenAI’s Responses API exposes previous_response_id as a request field and returns the id in the response. Vanilla OpenAI SDK code must extract response.id, store it, and pass it back manually. There is no equivalent feature in the Anthropic or Google SDKs. ORXA automates the extraction and re-injection via llm.assistantMessage() + the stateful resolution logic in complete(), and provides the same code path with a transparent fallback for providers that do not support server state.

assistantMessage() is required for auto-detection. If you push a bare { role: 'assistant', content: r1.text } the message carries no origin and the SDK cannot find the server-state id. Always use llm.assistantMessage(r) to stamp assistant turns in stateful conversations.

Expired ids throw. OpenAI and xAI return a 4xx error when a state id has expired. Wrap the second-turn call in a try/catch and fall back to resending full history if you receive this error in long-running or persisted sessions.

Model pinning for server state. The SDK checks that the origin.model in the assistant message matches the current client’s model before sending the server-state id. If you switch model mid-conversation (e.g. upgrade from gpt-4o-mini to gpt-4o) the id is silently dropped and full history is sent instead.

Next steps: