Image input (vision)
What you will achieve
Section titled “What you will achieve”Send a small solid-red PNG and prompt 'What color is this image? One word.'. Assert the response matches /red/i on OpenAI, Anthropic, and Google — with one call shape regardless of provider.
When and why you need this
Section titled “When and why you need this”Any task that asks the model to reason about visual content: reading a chart, describing a photo, identifying text in a screenshot, or analysing a UI mockup.
The challenge with raw provider SDKs is that image content blocks are completely different shapes:
- OpenAI Responses API wraps images as
{ type: 'input_image', image_url: { url: 'data:image/png;base64,...' } }in theinputarray. - Anthropic uses
{ type: 'image', source: { type: 'base64', media_type, data } }inside acontentarray. - Google uses
{ inlineData: { mimeType, data } }as apartsentry incontents.
Each provider also expects base64 encoding done differently (data-URL prefix for some, raw string for others). With multiple images in one message the divergence multiplies.
attachments unifies all of this into one list of file paths, URLs, or bytes.
Step by step
Section titled “Step by step”Step 1 — Send one image by file path
Section titled “Step 1 — Send one image by file path”import { complete } from '@combycode/llm-sdk';
const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: 'What color is this image? One word.', attachments: ['./red.png'], maxTokens: 32,});
console.log(text); // "Red"attachments accepts a local file path. The SDK reads the file with loadContent(), detects the MIME type (image/png, image/jpeg, etc.) from the file extension and magic bytes, base64-encodes the bytes, and places the result into an ImagePart with a base64 DataSource. The provider adapter then translates that into the correct wire shape.
Step 2 — Send an image by URL
Section titled “Step 2 — Send an image by URL”const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: 'Describe this image in one sentence.', attachments: ['https://example.com/photo.jpg'], maxTokens: 128,});When the attachment string starts with http:// or https://, loadContent() fetches the URL, reads the response bytes, detects the MIME type from the Content-Type header and/or the URL extension, and encodes the result as base64 — the same base64 DataSource reaches every provider. The provider never sees the URL itself (it always gets inline bytes).
Step 3 — Send raw bytes
Section titled “Step 3 — Send raw bytes”import { readFileSync } from 'fs';
const imageBytes = new Uint8Array(readFileSync('./chart.png'));
const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: 'What is the trend shown in this chart?', attachments: [imageBytes], maxTokens: 256,});Pass a Uint8Array when you already have the bytes in memory (from a canvas, upload buffer, etc.). MIME type is detected from the magic bytes (PNG header, JPEG FF D8 FF, GIF, WebP/RIFF), defaulting to image/png if no signature matches.
Step 4 — Send multiple images in one message
Section titled “Step 4 — Send multiple images in one message”const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, prompt: 'Which of these two images shows a cat?', attachments: ['./image-a.jpg', './image-b.jpg'], maxTokens: 64,});attachments is a list — each entry resolves to one ContentPart in the user message. The SDK appends the text prompt as a TextPart first, then each image part in order. All three providers accept multi-image content; the per-provider shape is handled internally.
Step 5 — Build content parts manually for more control
Section titled “Step 5 — Build content parts manually for more control”When you need to set per-image detail or mix sources, build the content array yourself:
import { complete } from '@combycode/llm-sdk';import type { ImagePart } from '@combycode/llm-sdk';import { readFileSync } from 'fs';import { Buffer } from 'buffer';
const raw = new Uint8Array(readFileSync('./diagram.png'));const b64 = Buffer.from(raw).toString('base64');
const imagePart: ImagePart = { type: 'image', source: { type: 'base64', mimeType: 'image/png', data: b64 }, detail: 'high',};
const { text } = await complete({ model: process.env.LLM_MODEL!, apiKey: process.env.LLM_API_KEY, messages: [ { role: 'user', content: [ { type: 'text', text: 'Describe the components in this architecture diagram.' }, imagePart, ], }, ], maxTokens: 512,});The manual path gives you access to detail (OpenAI only) and lets you mix DataSource variants in one message.
Your options
Section titled “Your options”ImagePart shape (ContentPart of type 'image'):
| Field | Type | Description |
|---|---|---|
type | 'image' | Discriminator. |
source | DataSource | Where the bytes come from (see table below). |
detail | 'auto' | 'low' | 'high' | Optional. Controls tile-level resolution for OpenAI vision models. Ignored by Anthropic and Google. Defaults to 'auto'. |
DataSource variants for images:
type | Required fields | When to use |
|---|---|---|
'base64' | mimeType: string, data: string (raw base64, no data: prefix) | Bytes you have in memory as a base64 string. Most common output of loadContent(). |
'url' | url: string | A public URL. The SDK fetches and re-encodes before sending — the provider never sees the URL. |
'buffer' | mimeType: string, data: Uint8Array | Raw bytes in memory. MIME type is sniffed from magic bytes and the declared value is corrected if it mismatches. |
'file' | fileId: string | A file already uploaded via the Files API. Provider translates to its own file-reference format. |
'path' | mimeType: string, path: string | Local file path (Node/Bun only). The SDK reads and encodes the file. Use attachments instead for simpler calls. |
'provider_ref' | mimeType: string, refId: string | An opaque provider-specific reference (e.g. a Google Files API URI). Passed through as fileData. |
detail trade-offs (OpenAI only):
| Value | Token cost | Quality | Use when |
|---|---|---|---|
'auto' | Variable | Model decides | Default; suitable for most tasks. |
'low' | Fixed low (~85 tokens) | Coarse (512x512 tile) | Fast queries where spatial precision is not needed (e.g. “is this a dog?”). |
'high' | Variable (512x512 tiles) | Full resolution | Documents, diagrams, screenshots, any task requiring fine detail. |
MIME types auto-detected by the SDK:
| Extension / Magic | MIME type |
|---|---|
.png / PNG header | image/png |
.jpg / .jpeg / FF D8 FF | image/jpeg |
.gif / GIF38 | image/gif |
.webp / RIFF+WEBP | image/webp |
Any format not in this list falls back to image/png. Override explicitly with a buffer or base64 DataSource and set the correct mimeType.
Provider support for image input:
| Provider | Supported models | Notes |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o mini, o1, o3, gpt-4-turbo | Responses API: input_image block. Chat Completions: image_url block. |
| Anthropic | Claude 3+ (Haiku, Sonnet, Opus) | image block with base64 source. |
| Gemini 1.5+, Gemini 2.0+ | inlineData or fileData part. |
Compare the SDKs
Section titled “Compare the SDKs”Every official SDK builds a different content block shape and different base64 encoding convention. OpenAI’s Responses API wraps images in input_image items inside the input array; the Chat Completions path uses image_url with a data-URL string inside content[].image_url.url. Anthropic uses source.type = 'base64' with a media_type field. Google uses inlineData.mimeType / inlineData.data. ORXA resolves a single base64 DataSource into the correct shape per provider — your code does not branch.
Gotchas and next steps
Section titled “Gotchas and next steps”URLs are always fetched by the SDK, not passed through. OpenAI’s Responses API can accept raw URLs natively, but ORXA’s url DataSource still fetches and re-encodes — this ensures uniform behaviour across all providers. Pass a base64 or buffer DataSource if you need to avoid the extra fetch.
Large images cost more tokens. With detail: 'high', OpenAI tiles the image into 512x512 patches. A 2048x2048 image generates 16 tiles at ~170 tokens each — about 2700 tokens of image overhead. Use detail: 'low' for yes/no queries on large images.
Anthropic has a 5 MB per-image limit on base64-encoded size. For larger images, resize before sending.
GIF animation is not understood. All three providers receive only the first frame of a GIF (or the entire still image if it is not animated).
Next steps:
- PDF document input — same attachments API,
DocumentPartshape - Audio input — send audio files to a multimodal model
- File upload — persist a file server-side and reference it across calls