Skip to main content

Introduction

The extract method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:
  1. Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
  2. Extraction: Use Retab’s extract method to process the document and retrieve structured data.
  3. Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
The SDKs already help here:
  • Python parses message.parsed against the JSON schema you pass in
  • Node accepts a JSON schema object, a schema file path, or a zod schema directly
You can still add your own application-level validation if you want stricter business rules. Unlike the parse method that focuses on raw text extraction, extract provides:
  • Structured Output: Data extracted directly into JSON format matching your schema.
  • AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
  • Modality Support: Works with text, images, or native document formats.
  • Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
  • Likelihood Scores: Provides likelihood scores for each extracted field.
  • Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractRequest
ExtractRequest
Returns
RetabParsedChatCompletion
A RetabParsedChatCompletion with the extracted data, usage details, likelihood scores, and a persisted extraction_id.
from retab import Retab

client = Retab()

# Use client.documents.extract_stream(...) for stream mode
doc_msg = client.documents.extract(
    document = "freight/booking_confirmation.jpg", 
    model="retab-micro",
    json_schema = {
      'X-SystemPrompt': 'You are a useful assistant.',
      'properties': {
          'name': {
              'description': 'The name of the calendar event.',
              'title': 'Name',
              'type': 'string'
          },
          'date': {
              'description': 'The date of the calendar event in ISO 8601 format.',
              'title': 'Date',
              'type': 'string'
          }
      },
      'required': ['name', 'date'],
      'title': 'CalendarEvent',
      'type': 'object'
    },
    n_consensus=1 # 1 means disabled (default), if greater than 1 it will run the extraction with n-consensus mode
)

print(doc_msg.data)
print(doc_msg.likelihoods)
print(doc_msg.extraction_id)

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate likelihood scores before saving to a database.
from retab import Retab

client = Retab()

# Extract data
result = client.documents.extract(
    document="freight/booking_confirmation.jpg",
    model="retab-micro",
    json_schema={
        'X-SystemPrompt': 'You are a useful assistant.',
        'properties': {
            'name': {
                'description': 'The name of the calendar event.',
                'title': 'Name',
                'type': 'string'
            },
            'date': {
                'description': 'The date of the calendar event in ISO 8601 format.',
                'title': 'Date',
                'type': 'string'
            }
        },
        'required': ['name', 'date'],
        'title': 'CalendarEvent',
        'type': 'object'
    },
    n_consensus=1
)

# Python SDK parses against the schema automatically
extracted_data = result.data
likelihoods = result.likelihoods or {}

if extracted_data is None:
    print("Schema validation failed")
else:
    print(f"Extracted Event: {extracted_data.name} on {extracted_data.date}")

    if all(score > 0.7 for score in likelihoods.values()):
        print("High likelihood extraction - Saving to DB...")
    else:
        print("Low likelihood - Review manually")

print(f"Processed with {result.usage.total_tokens} tokens")

Use Case: Using Additional Messages for Context

Use additional_messages to provide extra context or specific instructions that help guide the extraction. This is useful when you need to clarify ambiguous fields, provide domain-specific knowledge, or correct the model’s behavior.
from retab import Retab

client = Retab()

# Extract invoice data with additional context
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    model="retab-micro",
    json_schema={
        'properties': {
            'vendor_name': {'type': 'string', 'description': 'Name of the vendor'},
            'invoice_number': {'type': 'string', 'description': 'Invoice number'},
            'total_amount': {'type': 'number', 'description': 'Total amount due'},
            'currency': {'type': 'string', 'description': 'Currency code (e.g., USD, EUR)'}
        },
        'required': ['vendor_name', 'invoice_number', 'total_amount'],
        'type': 'object'
    },
    additional_messages=[
        {
            "role": "developer",
            "content": "Extract values exactly as written. Do not infer missing currency."
        },
        {
            "role": "user",
            "content": "Note: This invoice is from our European supplier. Amounts should be in EUR unless explicitly stated otherwise."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.example.com/logo.png",
                        "detail": "auto"
                    }
                }
            ]
        }
    ]
)

print(result.data)

Metadata Filters

Use metadata at extraction time to tag each document with stable identifiers, then filter those extractions later using the same keys. This is especially important in multi-tenant systems: always include stable tenant-scoping metadata keys to avoid cross-tenant contamination.

1. Tag the extraction request with metadata

from retab import Retab

client = Retab()

result = client.documents.extract(
    document="freight/booking_confirmation.jpg",
    model="retab-small",
    json_schema=my_schema,
    metadata={
        "organization_id": "org_123",
        "source": "sinari_jobs",
        "project_id": "project_abc",
    },
)

2. Filter extractions by metadata

from retab import Retab

client = Retab()

tenant_extractions = client.extractions.list(
    limit=50,
    metadata={
        "organization_id": "org_123",
        "source": "sinari_jobs",
    },
)
When multiple metadata keys are provided, all keys are applied together as exact-match filters.

Best Practices

Model Selection

  • retab-large: Use for complex documents requiring deep contextual understanding.
  • retab-small: Balanced for accuracy and cost, recommended for most extraction tasks.
  • retab-micro: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

  • Keep schemas concise: Only include required fields to improve extraction accuracy.
  • Use descriptive description fields: Helps the AI model understand what to extract.
  • Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

  • Set a threshold (e.g., 0.7) for automated processing.
  • For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Using Additional Messages

  • Use additional_messages to provide domain-specific context or clarifications.
  • Prefer developer or system messages for extraction policy and user messages for document-specific context.
  • Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
  • Keep messages concise to avoid diluting the extraction focus.
# Create messages from a reference document

# Use those messages as additional context for extraction
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    json_schema=my_schema,
    model="retab-small",
    additional_messages=[{
        "role": "user",
        "content": "Note: This invoice is from our European supplier. Amounts should be in EUR unless explicitly stated otherwise."
    }]
)