Count input tokens

What you will achieve

Count the tokens in a prompt before sending it — using one countTokens() call that automatically uses the correct counting method for each provider.

When and why you need this

You need token counts to:

Enforce prompt budgets — bail out before sending a prompt that would exceed the model’s context window or your cost budget.
Select models dynamically — choose a larger-context model when a prompt is long.
Estimate cost — multiply by per-token price before committing to the request.
Chunk documents — split inputs into pieces that fit within a model’s limit.

Each provider counts tokens differently:

OpenAI uses tiktoken, a local BPE tokeniser (no network call, instantaneous).
Anthropic exposes a messages.countTokens beta API endpoint (network call, ~100ms).
Google exposes models.countTokens (network call).
xAI, OpenRouter have no count API — fall back to a character heuristic.

Setting up each one manually requires separate packages, separate API calls, and separate error handling.

Step by step

Step 1 — Count a plain text prompt

import { countTokens } from '@combycode/llm-sdk';

const n = await countTokens({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  input: 'The quick brown fox jumps over the lazy dog.',
});

console.log(n); // e.g. 10 (varies by model tokenizer)

countTokens() returns a plain number. The call is async because Anthropic and Google require network round-trips; for OpenAI the promise resolves synchronously.

Step 2 — Count a multi-turn conversation

Pass a Message[] array to count the tokens for the entire conversation, including role delimiters and turn boundaries that the model tokeniser adds:

import { countTokens, type Message } from '@combycode/llm-sdk';

const messages: Message[] = [
  { role: 'system',    content: 'You are a helpful assistant.' },
  { role: 'user',      content: 'What is the capital of France?' },
  { role: 'assistant', content: 'The capital of France is Paris.' },
  { role: 'user',      content: 'And Germany?' },
];

const n = await countTokens({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  input: messages,
});

console.log(`Conversation is ${n} tokens`);

Step 3 — Use token count to enforce a budget

const MAX_INPUT_TOKENS = 4000;

const n = await countTokens({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  input: myLongDocumentText,
});

if (n > MAX_INPUT_TOKENS) {
  throw new Error(`Prompt too long: ${n} tokens, limit is ${MAX_INPUT_TOKENS}`);
}

// Safe to send
const { text } = await complete({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  prompt: myLongDocumentText,
  maxTokens: 512,
});

Step 4 — Use token count for dynamic model selection

When you do not know ahead of time whether the input will fit a small or large model:

import { countTokens, select, complete } from '@combycode/llm-sdk';

const n = await countTokens({
  model: 'openai/gpt-4o-mini',  // Use a small model's tokenizer for estimation
  apiKey: process.env.OPENAI_KEY,
  input: largeDocumentText,
});

// gpt-4o-mini has 128k context; gpt-4o has 128k too but costs more.
// For documents > 32k tokens, route to the full 4o for better coherence:
const model = n > 32_000
  ? 'openai/gpt-4o'
  : 'openai/gpt-4o-mini';

const { text } = await complete({
  model,
  apiKey: process.env.OPENAI_KEY,
  prompt: largeDocumentText,
  maxTokens: 1024,
});

Step 5 — Use history’s built-in token estimate

For a ConversationHistory managed by an AgentLoop, the history object tracks token estimates across turns without extra calls:

import { ConversationHistory } from '@combycode/llm-sdk';

const history = new ConversationHistory();
// ... after several turns of conversation ...

const estimated = history.estimatedTokens();
console.log(`History is roughly ${estimated} tokens`);

estimatedTokens() uses the last provider-reported exact inputTokens from the most recent response as an anchor, then adds estimates for any new messages appended since. This is accurate to within 5-10% for English text and requires no network call.

Your options

countTokens() accepts:

Option	Type	Notes
`model`	`string`	Required. Determines which counter to use.
`apiKey`	`string`	Required for Anthropic and Google (network call). Can be omitted for OpenAI (local tiktoken).
`input`	`string \| Message[]`	Required. The text or message array to count.

Counting method per provider:

Provider	Method	Network?	Accuracy
`openai/...`	Local tiktoken encoder	No	Exact for most GPT models
`anthropic/...`	`messages.countTokens` beta endpoint	Yes (~100ms)	Exact
`google/...`	`models.countTokens` API	Yes (~150ms)	Exact
`xai/...`	Character heuristic (chars / 4)	No	Approximate (+/- 20%)
`openrouter/...`	Character heuristic (chars / 4)	No	Approximate (+/- 20%)

For providers that use the heuristic: the character-based estimate is fast and good enough for budget enforcement with a safety margin. Add 20-25% buffer when using it as a hard cut-off.

When to use countTokens() vs history.estimatedTokens():

Use countTokens() when you have a discrete piece of text (a new document, a user message) and need an accurate count before attaching it to anything. Use history.estimatedTokens() when you have a live ConversationHistory and want to know if the conversation is approaching a context limit — it is cheaper because it leverages already-received usage data from the provider.

Compare the SDKs

import { countTokens } from '@combycode/llm-sdk';

// One `countTokens()` — picks the right counter per model from the catalog
// (tiktoken for OpenAI, count-API for Anthropic/Google, heuristic otherwise).
const t0 = performance.now();
const n = await countTokens({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  input: 'The quick brown fox jumps over the lazy dog.',
});

console.log(JSON.stringify({ result: String(n), ms: Math.round(performance.now() - t0) }));

The structural difference: official SDKs each require different setup. OpenAI’s tiktoken is a separate npm package you install and load a model encoding from. Anthropic’s count API takes the same shape as messages.create. Google’s count API takes a GenerateContentRequest. None of these share an interface. ORXA’s countTokens() is one async function — you pass a model string, it selects the right method automatically.

Gotchas and next steps

Anthropic and Google count calls cost money. Each countTokens() call to Anthropic or Google is a billable API call. For high-frequency applications, cache the count for a given document string (content hash as cache key) rather than re-counting identical text on every request.

Tokenizer drift across model versions. OpenAI’s tiktoken uses a fixed BPE vocabulary per model family. Switching from gpt-4o to o3 may use a different vocabulary, so counts are not identical. Always count against the model you will actually use.

System prompt tokens. If you use a fixed system prompt on every call, count it once at startup and add that constant to each prompt count rather than re-counting it on every turn. The system prompt tokens are included in the provider’s reported usage.inputTokens on each response.

Next steps:

Prompt caching — cache large stable prompts to save tokens
Cost tracking — translate token counts into dollar estimates
Quickstart — the call that consumes those tokens