Count input tokens
What you will achieve
Section titled “What you will achieve”Count the tokens in a prompt before sending it — using one countTokens() call that
automatically uses the correct counting method for each provider.
When and why you need this
Section titled “When and why you need this”You need token counts to:
- Enforce prompt budgets — bail out before sending a prompt that would exceed the model’s context window or your cost budget.
- Select models dynamically — choose a larger-context model when a prompt is long.
- Estimate cost — multiply by per-token price before committing to the request.
- Chunk documents — split inputs into pieces that fit within a model’s limit.
Each provider counts tokens differently:
- OpenAI uses tiktoken, a local BPE tokeniser (no network call, instantaneous).
- Anthropic exposes a
messages.countTokensbeta API endpoint (network call, ~100ms). - Google exposes
models.countTokens(network call). - xAI, OpenRouter have no count API — fall back to a character heuristic.
Setting up each one manually requires separate packages, separate API calls, and separate error handling.
Step by step
Section titled “Step by step”Step 1 — Count a plain text prompt
Section titled “Step 1 — Count a plain text prompt”import { countTokens } from '@combycode/llm-sdk';
const n = await countTokens({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, input: 'The quick brown fox jumps over the lazy dog.',});
console.log(n); // e.g. 10 (varies by model tokenizer)countTokens() returns a plain number. The call is async because Anthropic and
Google require network round-trips; for OpenAI the promise resolves synchronously.
Step 2 — Count a multi-turn conversation
Section titled “Step 2 — Count a multi-turn conversation”Pass a Message[] array to count the tokens for the entire conversation, including
role delimiters and turn boundaries that the model tokeniser adds:
import { countTokens, type Message } from '@combycode/llm-sdk';
const messages: Message[] = [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'What is the capital of France?' }, { role: 'assistant', content: 'The capital of France is Paris.' }, { role: 'user', content: 'And Germany?' },];
const n = await countTokens({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, input: messages,});
console.log(`Conversation is ${n} tokens`);Step 3 — Use token count to enforce a budget
Section titled “Step 3 — Use token count to enforce a budget”const MAX_INPUT_TOKENS = 4000;
const n = await countTokens({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, input: myLongDocumentText,});
if (n > MAX_INPUT_TOKENS) { throw new Error(`Prompt too long: ${n} tokens, limit is ${MAX_INPUT_TOKENS}`);}
// Safe to sendconst { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: myLongDocumentText, maxTokens: 512,});Step 4 — Use token count for dynamic model selection
Section titled “Step 4 — Use token count for dynamic model selection”When you do not know ahead of time whether the input will fit a small or large model:
import { countTokens, select, complete } from '@combycode/llm-sdk';
const n = await countTokens({ model: 'openai/gpt-4o-mini', // Use a small model's tokenizer for estimation apiKey: process.env.OPENAI_KEY, input: largeDocumentText,});
// gpt-4o-mini has 128k context; gpt-4o has 128k too but costs more.// For documents > 32k tokens, route to the full 4o for better coherence:const model = n > 32_000 ? 'openai/gpt-4o' : 'openai/gpt-4o-mini';
const { text } = await complete({ model, apiKey: process.env.OPENAI_KEY, prompt: largeDocumentText, maxTokens: 1024,});Step 5 — Use history’s built-in token estimate
Section titled “Step 5 — Use history’s built-in token estimate”For a ConversationHistory managed by an AgentLoop, the history object tracks
token estimates across turns without extra calls:
import { ConversationHistory } from '@combycode/llm-sdk';
const history = new ConversationHistory();// ... after several turns of conversation ...
const estimated = history.estimatedTokens();console.log(`History is roughly ${estimated} tokens`);estimatedTokens() uses the last provider-reported exact inputTokens from the most
recent response as an anchor, then adds estimates for any new messages appended since.
This is accurate to within 5-10% for English text and requires no network call.
Your options
Section titled “Your options”countTokens() accepts:
| Option | Type | Notes |
|---|---|---|
model | string | Required. Determines which counter to use. |
apiKey | string | Required for Anthropic and Google (network call). Can be omitted for OpenAI (local tiktoken). |
input | string | Message[] | Required. The text or message array to count. |
Counting method per provider:
| Provider | Method | Network? | Accuracy |
|---|---|---|---|
openai/... | Local tiktoken encoder | No | Exact for most GPT models |
anthropic/... | messages.countTokens beta endpoint | Yes (~100ms) | Exact |
google/... | models.countTokens API | Yes (~150ms) | Exact |
xai/... | Character heuristic (chars / 4) | No | Approximate (+/- 20%) |
openrouter/... | Character heuristic (chars / 4) | No | Approximate (+/- 20%) |
For providers that use the heuristic: the character-based estimate is fast and good enough for budget enforcement with a safety margin. Add 20-25% buffer when using it as a hard cut-off.
When to use countTokens() vs history.estimatedTokens():
Use countTokens() when you have a discrete piece of text (a new document, a user
message) and need an accurate count before attaching it to anything. Use
history.estimatedTokens() when you have a live ConversationHistory and want to
know if the conversation is approaching a context limit — it is cheaper because it
leverages already-received usage data from the provider.
Compare the SDKs
Section titled “Compare the SDKs”The structural difference: official SDKs each require different setup. OpenAI’s
tiktoken is a separate npm package you install and load a model encoding from.
Anthropic’s count API takes the same shape as messages.create. Google’s count API
takes a GenerateContentRequest. None of these share an interface. ORXA’s
countTokens() is one async function — you pass a model string, it selects the right
method automatically.
Gotchas and next steps
Section titled “Gotchas and next steps”Anthropic and Google count calls cost money. Each countTokens() call to Anthropic
or Google is a billable API call. For high-frequency applications, cache the count for
a given document string (content hash as cache key) rather than re-counting identical
text on every request.
Tokenizer drift across model versions. OpenAI’s tiktoken uses a fixed BPE
vocabulary per model family. Switching from gpt-4o to o3 may use a different
vocabulary, so counts are not identical. Always count against the model you will
actually use.
System prompt tokens. If you use a fixed system prompt on every call, count it
once at startup and add that constant to each prompt count rather than re-counting it
on every turn. The system prompt tokens are included in the provider’s reported
usage.inputTokens on each response.
Next steps:
- Prompt caching — cache large stable prompts to save tokens
- Cost tracking — translate token counts into dollar estimates
- Quickstart — the call that consumes those tokens