Extract

Introduction

The extract method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:

Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
Extraction: Use Retab’s extract method to process the document and retrieve structured data.
Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.

For advanced validation or post-processing, we recommend combining this with schema validation libraries like Pydantic (Python) or Zod (JavaScript) to ensure data integrity. Unlike the parse method that focuses on raw text extraction, extract provides:

Structured Output: Data extracted directly into JSON format matching your schema.
AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
Modality Support: Works with text, images, or native document formats.
Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
Likelihood Scores: Provides confidence scores for each extracted field.
Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractRequest

Show properties

document

string | object

required

The document to extract from. Can be a file path (string), or an object with filename (string) and url (string, e.g., base64-encoded data).

model

string

required

The AI model to use for extraction. Examples: retab-small for balanced accuracy and speed.

json_schema

object

required

The JSON schema defining the structure of the extracted data. Includes properties, required fields, and optional X-SystemPrompt for custom instructions.

image_resolution_dpi

integer

default:"192"

The DPI of the image sent to the LLM. Defaults to 192.

temperature

float

default:"0.0"

The temperature for sampling. Defaults to 0.0.

reasoning_effort

ChatCompletionReasoningEffort

default:"minimal"

The effort level for the model to reason about the input data. Defaults to “minimal”.

chunking_keys

dict[str, str] | None

default:"None"

If set, keys to be used for the extraction of long lists of data using Parallel OCR. Defaults to None.

metadata

dict[str, str]

default:"{}"

User-defined metadata to associate with this extraction. Defaults to .

n_consensus

integer

default:"1"

Number of consensus runs. Set to >1 for multi-run averaging to improve accuracy on uncertain extractions (increases cost).

additional_messages

list[ChatCompletionRetabMessage]

default:"null"

Additional chat messages to append after the document content. Useful for providing extra context, clarifications, or specific instructions that guide the extraction. Each message should have a role (“user” or “assistant”) and content (string or array of content parts).

Returns

ParsedChatCompletion

A ParsedChatCompletion object with the extracted data, usage details, and confidence scores.

Show properties

content

object

The core response content, including the OpenAI-compatible chat completion structure.

Show properties

string

Unique identifier for the completion.

choices

array[object]

Array of completion choices, typically containing one object with the extracted message.

created

integer

Timestamp of creation (Unix epoch).

model

string

The model used for extraction.

object

string

Object type, e.g., “chat.completion”.

service_tier

string

Service tier used.

system_fingerprint

string

System fingerprint for tracking.

usage

object

Token usage details.

likelihoods

object

Confidence scores for each extracted field (0-1 scale).

error

object | null

Error details if the request failed, otherwise null.

from retab import Retab

client = Retab()

# doc_msg = client.documents.extractions.stream(...) for stream mode
doc_msg = client.documents.extract(
    document = "freight/booking_confirmation.jpg", 
    model="retab-small",
    json_schema = {
      'X-SystemPrompt': 'You are a useful assistant.',
      'properties': {
          'name': {
              'description': 'The name of the calendar event.',
              'title': 'Name',
              'type': 'string'
          },
          'date': {
              'description': 'The date of the calendar event in ISO 8601 format.',
              'title': 'Date',
              'type': 'string'
          }
      },
      'required': ['name', 'date'],
      'title': 'CalendarEvent',
      'type': 'object'
    },
    n_consensus=1 # 1 means disabled (default), if greater than 1 it will run the extraction with n-consensus mode
)

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate confidence scores before saving to a database.

from retab import Retab
from pydantic import BaseModel, ValidationError

client = Retab()

# Define Pydantic model matching the schema for validation
class CalendarEvent(BaseModel):
    name: str
    date: str  # ISO 8601

# Extract data
result = client.documents.extract(
    document="freight/booking_confirmation.jpg",
    model="retab-small",
    json_schema={
        'X-SystemPrompt': 'You are a useful assistant.',
        'properties': {
            'name': {
                'description': 'The name of the calendar event.',
                'title': 'Name',
                'type': 'string'
            },
            'date': {
                'description': 'The date of the calendar event in ISO 8601 format.',
                'title': 'Date',
                'type': 'string'
            }
        },
        'required': ['name', 'date'],
        'title': 'CalendarEvent',
        'type': 'object'
    },
    n_consensus=1
)

# Access extracted data
extracted_data = result.content.choices[0].message.parsed
likelihoods = result.content.likelihoods

# Validate with Pydantic
try:
    event = CalendarEvent(**extracted_data)
    print(f"Extracted Event: {event.name} on {event.date}")
    
    # Check confidence
    if all(score > 0.7 for score in likelihoods.values()):
        print("High confidence extraction - Saving to DB...")
        # db.save(event)  # Pseudo-code for DB integration
    else:
        print("Low confidence - Review manually")
except ValidationError as e:
    print(f"Validation failed: {e}")

print(f"Processed with {result.content.usage.total_tokens} tokens")

Use Case: Using Additional Messages for Context

Use additional_messages to provide extra context or specific instructions that help guide the extraction. This is useful when you need to clarify ambiguous fields, provide domain-specific knowledge, or correct the model’s behavior.

from retab import Retab

client = Retab()

# Extract invoice data with additional context
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    model="retab-small",
    json_schema={
        'properties': {
            'vendor_name': {'type': 'string', 'description': 'Name of the vendor'},
            'invoice_number': {'type': 'string', 'description': 'Invoice number'},
            'total_amount': {'type': 'number', 'description': 'Total amount due'},
            'currency': {'type': 'string', 'description': 'Currency code (e.g., USD, EUR)'}
        },
        'required': ['vendor_name', 'invoice_number', 'total_amount'],
        'type': 'object'
    },
    additional_messages=[
        {
            "role": "user", 
            "content": "Note: This invoice is from our European supplier. Amounts should be in EUR unless explicitly stated otherwise."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.example.com/logo.png"
                    }
                }
            ]
        }
    ]
)

print(result.choices[0].message.parsed)

Best Practices

Model Selection

retab-large: Use for complex documents requiring deep contextual understanding.
retab-small: Balanced for accuracy and cost, recommended for most extraction tasks.
retab-micro: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

Keep schemas concise: Only include required fields to improve extraction accuracy.
Use descriptive description fields: Helps the AI model understand what to extract.
Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

Set a threshold (e.g., 0.7) for automated processing.
For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Using Additional Messages

Use additional_messages to provide domain-specific context or clarifications.
Simulate a conversation: Add a user message with context, then an assistant acknowledgment to prime the model.
Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
Keep messages concise to avoid diluting the extraction focus.
You can also use the create_messages API to convert additional documents into chat messages, then pass them as additional_messages for multi-document context:

# Create messages from a reference document
ref_messages = client.documents.create_messages(document="reference/template.pdf")

# Use those messages as additional context for extraction
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    json_schema=my_schema,
    model="retab-small",
    additional_messages=ref_messages.messages
)

Overview

Core Concepts

Consensus

Introduction

Extract API

Use Case: Extracting Event Information from Documents

Use Case: Using Additional Messages for Context

Best Practices

Model Selection

Schema Design

Confidence Handling

Using Additional Messages

Overview

Core Concepts

Consensus

​Introduction

​Extract API

​Use Case: Extracting Event Information from Documents

​Use Case: Using Additional Messages for Context

​Best Practices

​Model Selection

​Schema Design

​Confidence Handling

​Using Additional Messages

Introduction

Extract API

Use Case: Extracting Event Information from Documents

Use Case: Using Additional Messages for Context

Best Practices

Model Selection

Schema Design

Confidence Handling

Using Additional Messages