Monday, 29 June 2026 PDT | 03:03 PM
The 1 News Alt Logo Text Smart News for Global Indians

Pair Nova 2 Lite with Claude for cost

AI News June 30, 2026 02:00 AM
Pair Nova 2 Lite with Claude for cost

A scanned yearbook page contains 176 printed names, 4 portrait photographs, and zero machine-readable structure linking them. To digitize this page, you need reliable photo detection with bounding boxes and accurate name extraction. You also need a way to determine which name belongs to which face based on page layout.

In this post, we show how pairing Amazon Nova 2 Lite with Anthropic’s Claude Sonnet 4.6 delivers an efficient solution for digitizing scanned documents at scale. We built a two-model pipeline on Amazon Bedrock for digitizing scanned yearbook pages. Amazon Nova 2 Lite handles native multimodal extraction in a single call: detecting photos, extracting visible names with coordinates, and returning page-level metadata. Claude Sonnet 4.6 then performs spatial reasoning to match names to faces based on page layout.

We ran this pipeline against 336 scanned yearbook pages and produced 3,122 name-to-face associations, with 93 percent scoring at or above 0.95 confidence. This two-model approach costs about two-thirds less per page than a single-model alternative that sends the entire task to one vision-language model. See the Cost considerations section for the detailed breakdown.

The pipeline has two stages. Each stage uses a different model, chosen for the specific task it performs.

Figure 1. Two-model pipeline architecture. The scanned page image flows through two sequential stages. In stage 1, Amazon Nova 2 Lite performs native multimodal extraction in a single API call. It detects and classifies photos with bounding boxes, reads visible names on the page and returns their approximate positions, and emits page-level metadata. In stage 2, Claude Sonnet 4.6 performs spatial reasoning to match names to faces using the combined Nova output.

Amazon Nova 2 Lite runs first. Because it handles interleaved text and images natively, a single Converse call returns three things:

We set reasoning to LOW for this task by including a reasoning configuration in the Converse API call. See the reasoning block in the Step 1 code that follows. Testing across all 336 pages showed no meaningful accuracy difference between LOW, MEDIUM, and HIGH reasoning levels for this structured extraction, and LOW is the cheapest option. Nova exposes this setting through the reasoning_config field. Claude, in Step 2, uses a separate thinking field, so the two models control reasoning under different names.

Asking Nova 2 Lite only for names, not every OCR token on the page, is what keeps the first stage cheap. The downstream spatial reasoning step doesn’t need the full text of class rosters or event descriptions. It needs names that appear near photos. Constraining Nova output to names keeps the output-token cost at approximately 1,000 tokens per page instead of the approximated 4,500 tokens a full OCR pass would produce.

Claude Sonnet 4.6 enters only at stage 2 for the spatial reasoning step. Given Nova names-with-positions and photo bounding boxes, Claude determines which names correspond to which faces. This step requires handling page layout variability, because yearbook layouts vary from page to page. Captions might appear above or below photos, and some pages mix portrait grids with group shots. Claude adaptive thinking handles this variability without additional prompt engineering per layout type.

In this solution, Nova 2 Lite handles the high-volume extraction work natively, in one call. Claude is called once per page for the spatial reasoning step.

A recent change to how Amazon Nova 2 Lite bills image inputs makes per-page cost predictable at scale, which matters when you are processing hundreds of thousands of pages.

Fixed per-image pricing: Amazon Nova 2 Lite bills image and document page inputs at a fixed per-image rate, regardless of resolution or file size.

This change is significant for document processing pipelines. Previously, image token costs varied based on resolution, making it difficult to project per-page costs without running a proof-of-concept on representative samples. With fixed billing, every image Nova 2 Lite processes is billed at the same per-image rate, regardless of resolution.

For a full page extraction including prompt and output, the per-page cost breaks down as follows:

At published Nova 2 Lite input-token rates, image input is a small fraction of total per-page cost. For current rates, see the Amazon Bedrock pricing page.

For yearbook-scale workloads (hundreds of thousands of pages annually), this fixed pricing makes cost forecasting straightforward because image input cost scales linearly with page count and is independent of page resolution. No resolution normalization is required.

Adaptive thinking for spatial reasoning

Claude on Amazon Bedrock supports adaptive thinking, a feature where the model decides how much internal reasoning to apply based on the input complexity. You enable it by setting type to adaptive in the thinking configuration of the Converse API:

With adaptive thinking enabled, Claude adjusts its reasoning depth based on what it receives. A straightforward portrait grid with eight names neatly arranged above eight faces gets a direct response with minimal reasoning. A page where three group photos share a caption block and names appear in a sidebar triggers step-by-step spatial analysis.

In our 336-page run, Claude used extended reasoning on every page, with reasoning traces ranging from 544 to 1,658 characters. Even the simpler pages benefited from some spatial analysis because yearbook layouts are rarely perfectly uniform. The reasoning traces show Claude working through column alignment and vertical offsets between name positions and face positions, and checking caption proximity when group photos appear on the page.

For this type of structured spatial task, adaptive thinking gives you the right amount of reasoning per page without manual tuning. You don’t need to set a fixed token budget or write layout-specific prompts. The model reads the inputs and decides.

Cost note on adaptive thinking: When adaptive thinking is enabled, keep three cost factors in mind.

The full source code, sample images, and Jupyter notebook are available in the AWS Samples repository on GitHub.

Before running the pipeline, make sure that you have the following in place:

Send the scanned page to Amazon Nova 2 Lite with a prompt that requests both detected photos (with bounding boxes and classifications) and visible names (with approximate positions on the page). Nova native multimodal understanding returns both in a single Converse call.

Nova returns bounding boxes on a 0–1000 coordinate scale for both photos and names. Pass both directly into Step 2. Claude reads the same coordinate space when given in the prompt, so no conversion is needed.

The prompt instructs Nova to return a JSON object with both the photos and the names visible on the page.

Each photo gets a bounding box, a type (portrait, group, or candid), a category tag, and a short description. Each name gets its visible text and its bounding box on the page. The page_title and category fields also serve a second use case: metadata extraction. With one API call, Nova 2 Lite gives you photo detection, names-with-positions for the matching pipeline, and structured metadata. You can use this metadata for search indexing, filtering by event type, or building a table of contents across hundreds of pages.

Now pass Nova names-with-positions and photo bounding boxes to Claude for spatial reasoning. Both use the same 0–1000 coordinate space, so no normalization is needed:

On page 50 of our test set, Nova returned 176 name entries and 4 photo bounding boxes. Most of those names are roster and body text elsewhere on the page. Only the names adjacent to the 4 photos are matchable, so Claude produced 5 associations:

Each association includes a reasoning string that explains the spatial logic. This is useful for debugging pages where associations fail.

The final step applies confidence thresholds and fuzzy name matching (using rapidfuzz) to filter out low-quality associations. The pipeline writes two outputs per page: a JSON file with the association data and a visualization image showing lines drawn between matched names and faces.

We processed 336 scanned yearbook pages through this pipeline. The pipeline produced 3,122 name-to-face associations total, with 93.3 percent of those associations scoring confidence at or above 0.95. Only 0.3 percent fell below the 0.90 confidence threshold.

Figure 2. Confidence score distribution across the 3,122 name-to-face associations produced from 336 scanned yearbook pages. The distribution is heavily skewed toward high confidence: 2,912 associations (93.3 percent) scored at or above 0.95, 202 (6.5 percent) scored between 0.90 and 0.94, and only 8 (0.3 percent) fell below 0.90. Claude Sonnet 4.6 produced confidence scores during the spatial reasoning step. They reflect the model’s certainty that a given name token maps to a specific face bounding box.

The pipeline outputs a JSON report and a visualization for each page. The visualization draws colored lines from each name token to its matched face, which makes it easy to spot errors during manual review.

The two-model split means that you pay Nova 2 Lite rates for the combined extraction call (photos, names-with-positions, metadata) and Claude rates for one reasoning call per page. Under Nova 2 Lite’s fixed per-image pricing, the first stage cost is predictable and independent of input resolution. Claude reasoning step dominates per-page cost.

The following table compares the pipeline against a single-model approach where each page is sent to Claude to do all three tasks (OCR, photo detection, and spatial matching) in one call.

At 100,000 pages, that’s a ~$6,500 difference. Beyond cost, the split pipeline also gives you the ability to swap or tune each stage independently without touching the others. For example, you can upgrade only the reasoning model when a better Claude tier ships.

Exact per-page cost depends on three factors:

For a rough estimate, add one Nova 2 Lite Converse call with a single image input and a names-and-photos prompt. Then add one Claude Converse call where the input includes the image plus Nova JSON output. Check Amazon Bedrock pricing for current rates.

Actual cost for our 336-page test run

Based on our pipeline run (336 pages, average 10.9 associations per page):

Additional cost optimization levers

For high-volume or non-real-time workloads:

Each page goes through two sequential API calls, so total latency per page is the sum of both. In our test run, the Claude spatial reasoning step took the longest at roughly 20–30 seconds per page because of the image input and adaptive reasoning. Nova 2 Lite completed in a few seconds. For batch processing, you can parallelize across pages because each page is independent.

This pipeline uses serverless AWS services with no persistent infrastructure to manage:

In this post, we showed how splitting document processing across two models on Amazon Bedrock produced 3,122 name-to-face associations across 336 yearbook pages, with 93 percent at high confidence. Amazon Nova 2 Lite handled photo detection and name extraction in a single native-multimodal call. Claude Sonnet 4.6 handled spatial reasoning with adaptive thinking to match names to the right faces.

This pattern applies beyond yearbooks. Any document with photos and associated text (historical archives, personnel directories, real estate listings, product catalogs) needs the same capabilities: detecting visual elements, extracting the relevant text, and connecting the two through spatial reasoning. The solution we describe in this post combines photo detection, text extraction, and spatial reasoning to deliver both cost efficiency and high accuracy.

The full notebook and sample outputs are available in the AWS Samples repository on GitHub.

To learn more about the services used in this post, see the following resources: