Server-side conversation state

What you will achieve

Send turn 1 ('Remember the number 42.'), capture the server-state id, then send turn 2 ('What number?') with NO prior messages and confirm the model recalls 42.

When and why you need this

In standard client-side history you resend the full conversation on every turn. For long conversations this means:

Growing cost — input tokens increase each turn, even for context the model has already processed.
Growing latency — more tokens to transmit and process on each request.
Bandwidth — the transcript travels over the wire every single turn.

OpenAI’s Responses API and xAI’s Interactions API both support server-side state: the provider stores the conversation on their servers and you send only a previous_response_id on subsequent turns. The provider reconstructs context from its server cache and combines it with just the new user message. You pay for new tokens only.

Anthropic and Google do not offer this feature — they always require the full history.

Step by step

Step 1 — Create an `LLMClient` for a stateful provider

import { createLLM, type Message } from '@combycode/llm-sdk';

const llm = createLLM({
  model: process.env.LLM_MODEL!,   // e.g. 'openai/gpt-4o' or 'xai/grok-3'
  apiKey: process.env.LLM_API_KEY,
});

createLLM() automatically detects which API type the model uses. OpenAI models use the Responses API (api: 'responses'); xAI models use the Interactions API (api: 'interactions'). You do not configure this manually.

Step 2 — Send the first turn and capture the assistant message

const messages: Message[] = [
  { role: 'user', content: 'Remember the number 42.' },
];

const r1 = await llm.complete(messages);

// assistantMessage() stamps the server-state id (response_id / interaction_id)
// into the message's `origin.serverStateId` field.
messages.push(llm.assistantMessage(r1));

llm.assistantMessage(r1) does two things:

Creates a role: 'assistant' message with the model’s text.
When the client is on a stateful API (Responses or Interactions), embeds r1.id into origin.serverStateId on the message.

Without this step the next turn does not have the id needed to continue server-side.

Step 3 — Send the second turn — only the new message

messages.push({ role: 'user', content: 'What number did I ask you to remember?' });

// The SDK detects origin.serverStateId in the last assistant message,
// extracts it as previousResponseId, and sends only the new user message.
const r2 = await llm.complete(messages);

console.log(r2.text); // 'You asked me to remember 42.'

You pass the full messages array but the SDK decides what to actually send. When it finds a usable serverStateId in the most-recent assistant message (same provider, model within the TTL window), it sends only previousResponseId + the new user message. The provider reconstructs the rest from its cache.

Step 4 — Inspect what was actually sent

The decision is automatic but observable. On the response object:

console.log(r2.id);      // server-side response id for the next turn
console.log(r2.usage);   // input tokens will be much lower on turn 2+

On a non-stateful provider (Anthropic, Google) the same code still works — the SDK transparently falls back to sending the full history. No code change needed when you run the same application against a different provider.

Step 5 — Opt out of server-state

To always send full history regardless of provider:

const r2 = await llm.complete(messages, { stateful: false });

stateful: false disables the server-state optimisation for this call. Use it when:

You are debugging and want to confirm what history the model is actually using.
Your provider has a server-state bug and you need a workaround.
You are doing a capability test that requires full-history semantics.

Step 6 — Pass an explicit `previousResponseId`

You can also manage the id yourself:

const r1 = await llm.complete([{ role: 'user', content: 'Set x = 7.' }]);
const stateId = r1.id;

// Later -- just the new message + explicit id, no history array needed:
const r2 = await llm.complete(
  [{ role: 'user', content: 'What is x?' }],
  { previousResponseId: stateId },
);

When you set previousResponseId manually the SDK uses it verbatim and skips the automatic detection logic. This is useful when you persist state ids to a database and restore them across sessions.

Your options

Option / field	Where	Behaviour
`stateful: true`	Default	SDK auto-detects server-state id in the last assistant message and optimises the send.
`stateful: false`	`ExecuteOptions`	Always send full history. No server-state optimisation. Works on all providers.
`previousResponseId`	`ExecuteOptions`	Manual: pass the id explicitly. SDK uses it verbatim; skips auto-detection.
`llm.assistantMessage(r)`	`LLMClient` method	Creates the assistant `Message` with `origin.serverStateId` embedded. Required for auto-detection to work on the next turn.

Server-state availability:

Provider / API	Server-state support	Id field
OpenAI Responses API	Yes	`response_id`
xAI Interactions API	Yes	`interaction_id`
Anthropic Messages API	No — full history always	—
Google Generative AI	No — full history always	—

The SDK’s fallback (full history on non-stateful providers) means your code is portable: remove the provider prefix from model and the conversation still works, just without the bandwidth/cost savings.

TTL and id expiry: server-side state ids expire (24 hours on OpenAI at the time of writing). If you store an id and replay it after the TTL the provider returns an error. The SDK does not retry automatically — it propagates the provider error so you can handle it (e.g. by resending full history).

Compare the SDKs

import { createLLM, type Message } from '@combycode/llm-sdk';

// Server-state is ON by default. Where the provider supports it (OpenAI/xAI
// Responses), turn 2 sends ONLY the prior response id + the new turn — the SDK
// drops the transcript. `assistantMessage()` stamps the response into history
// with the server id; the brain decides id-vs-history (provider/model/TTL).
const llm = createLLM({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY });

const t0 = performance.now();
const messages: Message[] = [{ role: 'user', content: 'Remember the number 42.' }];
const r1 = await llm.complete(messages);
messages.push(llm.assistantMessage(r1));
messages.push({ role: 'user', content: 'What number did I ask you to remember? Reply with just the number.' });
const r2 = await llm.complete(messages);

console.log(JSON.stringify({ result: r2.text.trim(), ms: Math.round(performance.now() - t0) }));

The structural difference: OpenAI’s Responses API exposes previous_response_id as a request field and returns the id in the response. Vanilla OpenAI SDK code must extract response.id, store it, and pass it back manually. There is no equivalent feature in the Anthropic or Google SDKs. ORXA automates the extraction and re-injection via llm.assistantMessage() + the stateful resolution logic in complete(), and provides the same code path with a transparent fallback for providers that do not support server state.

Gotchas and next steps

assistantMessage() is required for auto-detection. If you push a bare { role: 'assistant', content: r1.text } the message carries no origin and the SDK cannot find the server-state id. Always use llm.assistantMessage(r) to stamp assistant turns in stateful conversations.

Expired ids throw. OpenAI and xAI return a 4xx error when a state id has expired. Wrap the second-turn call in a try/catch and fall back to resending full history if you receive this error in long-running or persisted sessions.

Model pinning for server state. The SDK checks that the origin.model in the assistant message matches the current client’s model before sending the server-state id. If you switch model mid-conversation (e.g. upgrade from gpt-4o-mini to gpt-4o) the id is silently dropped and full history is sent instead.

Next steps:

Multi-turn conversation — client-side history baseline
Prompt caching — reduce cost on repeated large prefixes (complementary to server state)
Layered context — manage dynamic system prompt layers across turns