Skip to content

Count input tokens

Count the tokens in a prompt before sending it — using one countTokens() call that automatically uses the correct counting method for each provider.

You need token counts to:

  1. Enforce prompt budgets — bail out before sending a prompt that would exceed the model’s context window or your cost budget.
  2. Select models dynamically — choose a larger-context model when a prompt is long.
  3. Estimate cost — multiply by per-token price before committing to the request.
  4. Chunk documents — split inputs into pieces that fit within a model’s limit.

Each provider counts tokens differently:

  • OpenAI uses tiktoken, a local BPE tokeniser (no network call, instantaneous).
  • Anthropic exposes a messages.countTokens beta API endpoint (network call, ~100ms).
  • Google exposes models.countTokens (network call).
  • xAI, OpenRouter have no count API — fall back to a character heuristic.

Setting up each one manually requires separate packages, separate API calls, and separate error handling.

import { countTokens } from '@combycode/llm-sdk';
const n = await countTokens({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
input: 'The quick brown fox jumps over the lazy dog.',
});
console.log(n); // e.g. 10 (varies by model tokenizer)

countTokens() returns a plain number. The call is async because Anthropic and Google require network round-trips; for OpenAI the promise resolves synchronously.

Step 2 — Count a multi-turn conversation

Section titled “Step 2 — Count a multi-turn conversation”

Pass a Message[] array to count the tokens for the entire conversation, including role delimiters and turn boundaries that the model tokeniser adds:

import { countTokens, type Message } from '@combycode/llm-sdk';
const messages: Message[] = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' },
{ role: 'assistant', content: 'The capital of France is Paris.' },
{ role: 'user', content: 'And Germany?' },
];
const n = await countTokens({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
input: messages,
});
console.log(`Conversation is ${n} tokens`);

Step 3 — Use token count to enforce a budget

Section titled “Step 3 — Use token count to enforce a budget”
const MAX_INPUT_TOKENS = 4000;
const n = await countTokens({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
input: myLongDocumentText,
});
if (n > MAX_INPUT_TOKENS) {
throw new Error(`Prompt too long: ${n} tokens, limit is ${MAX_INPUT_TOKENS}`);
}
// Safe to send
const { text } = await complete({
model: process.env.LLM_MODEL!,
apiKey: process.env.LLM_API_KEY,
prompt: myLongDocumentText,
maxTokens: 512,
});

Step 4 — Use token count for dynamic model selection

Section titled “Step 4 — Use token count for dynamic model selection”

When you do not know ahead of time whether the input will fit a small or large model:

import { countTokens, select, complete } from '@combycode/llm-sdk';
const n = await countTokens({
model: 'openai/gpt-4o-mini', // Use a small model's tokenizer for estimation
apiKey: process.env.OPENAI_KEY,
input: largeDocumentText,
});
// gpt-4o-mini has 128k context; gpt-4o has 128k too but costs more.
// For documents > 32k tokens, route to the full 4o for better coherence:
const model = n > 32_000
? 'openai/gpt-4o'
: 'openai/gpt-4o-mini';
const { text } = await complete({
model,
apiKey: process.env.OPENAI_KEY,
prompt: largeDocumentText,
maxTokens: 1024,
});

Step 5 — Use history’s built-in token estimate

Section titled “Step 5 — Use history’s built-in token estimate”

For a ConversationHistory managed by an AgentLoop, the history object tracks token estimates across turns without extra calls:

import { ConversationHistory } from '@combycode/llm-sdk';
const history = new ConversationHistory();
// ... after several turns of conversation ...
const estimated = history.estimatedTokens();
console.log(`History is roughly ${estimated} tokens`);

estimatedTokens() uses the last provider-reported exact inputTokens from the most recent response as an anchor, then adds estimates for any new messages appended since. This is accurate to within 5-10% for English text and requires no network call.

countTokens() accepts:

OptionTypeNotes
modelstringRequired. Determines which counter to use.
apiKeystringRequired for Anthropic and Google (network call). Can be omitted for OpenAI (local tiktoken).
inputstring | Message[]Required. The text or message array to count.

Counting method per provider:

ProviderMethodNetwork?Accuracy
openai/...Local tiktoken encoderNoExact for most GPT models
anthropic/...messages.countTokens beta endpointYes (~100ms)Exact
google/...models.countTokens APIYes (~150ms)Exact
xai/...Character heuristic (chars / 4)NoApproximate (+/- 20%)
openrouter/...Character heuristic (chars / 4)NoApproximate (+/- 20%)

For providers that use the heuristic: the character-based estimate is fast and good enough for budget enforcement with a safety margin. Add 20-25% buffer when using it as a hard cut-off.

When to use countTokens() vs history.estimatedTokens():

Use countTokens() when you have a discrete piece of text (a new document, a user message) and need an accurate count before attaching it to anything. Use history.estimatedTokens() when you have a live ConversationHistory and want to know if the conversation is approaching a context limit — it is cheaper because it leverages already-received usage data from the provider.

import { countTokens } from '@combycode/llm-sdk';

// One `countTokens()` — picks the right counter per model from the catalog
// (tiktoken for OpenAI, count-API for Anthropic/Google, heuristic otherwise).
const t0 = performance.now();
const n = await countTokens({
  model: process.env.LLM_MODEL!,
  apiKey: process.env.LLM_API_KEY,
  input: 'The quick brown fox jumps over the lazy dog.',
});

console.log(JSON.stringify({ result: String(n), ms: Math.round(performance.now() - t0) }));

The structural difference: official SDKs each require different setup. OpenAI’s tiktoken is a separate npm package you install and load a model encoding from. Anthropic’s count API takes the same shape as messages.create. Google’s count API takes a GenerateContentRequest. None of these share an interface. ORXA’s countTokens() is one async function — you pass a model string, it selects the right method automatically.

Anthropic and Google count calls cost money. Each countTokens() call to Anthropic or Google is a billable API call. For high-frequency applications, cache the count for a given document string (content hash as cache key) rather than re-counting identical text on every request.

Tokenizer drift across model versions. OpenAI’s tiktoken uses a fixed BPE vocabulary per model family. Switching from gpt-4o to o3 may use a different vocabulary, so counts are not identical. Always count against the model you will actually use.

System prompt tokens. If you use a fixed system prompt on every call, count it once at startup and add that constant to each prompt count rather than re-counting it on every turn. The system prompt tokens are included in the provider’s reported usage.inputTokens on each response.

Next steps: