Skip to main content

Introduction

The extract method in Retab’s document processing pipeline uses AI models to extract structured data from any document based on a provided JSON schema. This endpoint is ideal for automating data extraction tasks, such as pulling key information from invoices, forms, receipts, images, or scanned documents, for use in workflows like data entry automation, analytics, or integration with databases and applications. The typical extraction workflow follows these steps:
  1. Schema Definition: Define a JSON schema that describes the structure of the data you want to extract.
  2. Extraction: Use Retab’s extract method to process the document and retrieve structured data.
  3. Validation & Usage: Validate the extracted data (optionally using likelihood scores) and integrate it into your application.
For advanced validation or post-processing, we recommend combining this with schema validation libraries like Pydantic (Python) or Zod (JavaScript) to ensure data integrity. Unlike the parse method that focuses on raw text extraction, extract provides:
  • Structured Output: Data extracted directly into JSON format matching your schema.
  • AI-Powered Inference: Handles complex layouts, handwritten text, and contextual understanding.
  • Modality Support: Works with text, images, or native document formats.
  • Consensus Mode: Optional multi-run consensus for higher accuracy on ambiguous documents.
  • Likelihood Scores: Provides confidence scores for each extracted field.
  • Batch Processing Ready: Efficient for high-volume extraction tasks.

Extract API

ExtractRequest
ExtractRequest
Returns
ParsedChatCompletion
A ParsedChatCompletion object with the extracted data, usage details, and confidence scores.
from retab import Retab

client = Retab()

# doc_msg = client.documents.extractions.stream(...) for stream mode
doc_msg = client.documents.extract(
    document = "freight/booking_confirmation.jpg", 
    model="gpt-4.1-nano",
    json_schema = {
      'X-SystemPrompt': 'You are a useful assistant.',
      'properties': {
          'name': {
              'description': 'The name of the calendar event.',
              'title': 'Name',
              'type': 'string'
          },
          'date': {
              'description': 'The date of the calendar event in ISO 8601 format.',
              'title': 'Date',
              'type': 'string'
          }
      },
      'required': ['name', 'date'],
      'title': 'CalendarEvent',
      'type': 'object'
    },
    modality="text",
    n_consensus=1 # 1 means disabled (default), if greater than 1 it will run the extraction with n-consensus mode
)

Use Case: Extracting Event Information from Documents

Extract structured calendar event data from a booking confirmation image and validate confidence scores before saving to a database.
from retab import Retab
from pydantic import BaseModel, ValidationError

client = Retab()

# Define Pydantic model matching the schema for validation
class CalendarEvent(BaseModel):
    name: str
    date: str  # ISO 8601

# Extract data
result = client.documents.extract(
    document="freight/booking_confirmation.jpg",
    model="gpt-4.1-nano",
    json_schema={
        'X-SystemPrompt': 'You are a useful assistant.',
        'properties': {
            'name': {
                'description': 'The name of the calendar event.',
                'title': 'Name',
                'type': 'string'
            },
            'date': {
                'description': 'The date of the calendar event in ISO 8601 format.',
                'title': 'Date',
                'type': 'string'
            }
        },
        'required': ['name', 'date'],
        'title': 'CalendarEvent',
        'type': 'object'
    },
    modality="text",
    n_consensus=1
)

# Access extracted data
extracted_data = result.content.choices[0].message.parsed
likelihoods = result.content.likelihoods

# Validate with Pydantic
try:
    event = CalendarEvent(**extracted_data)
    print(f"Extracted Event: {event.name} on {event.date}")
    
    # Check confidence
    if all(score > 0.7 for score in likelihoods.values()):
        print("High confidence extraction - Saving to DB...")
        # db.save(event)  # Pseudo-code for DB integration
    else:
        print("Low confidence - Review manually")
except ValidationError as e:
    print(f"Validation failed: {e}")

print(f"Processed with {result.content.usage.total_tokens} tokens")

Use Case: Using Additional Messages for Context

Use additional_messages to provide extra context or specific instructions that help guide the extraction. This is useful when you need to clarify ambiguous fields, provide domain-specific knowledge, or correct the model’s behavior.
from retab import Retab

client = Retab()

# Extract invoice data with additional context
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    model="gpt-4.1-nano",
    json_schema={
        'properties': {
            'vendor_name': {'type': 'string', 'description': 'Name of the vendor'},
            'invoice_number': {'type': 'string', 'description': 'Invoice number'},
            'total_amount': {'type': 'number', 'description': 'Total amount due'},
            'currency': {'type': 'string', 'description': 'Currency code (e.g., USD, EUR)'}
        },
        'required': ['vendor_name', 'invoice_number', 'total_amount'],
        'type': 'object'
    },
    additional_messages=[
        {
            "role": "user", 
            "content": "Note: This invoice is from our European supplier. Amounts should be in EUR unless explicitly stated otherwise."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.example.com/logo.png"
                    }
                }
            ]
        }
    ]
)

print(result.choices[0].message.parsed)

Best Practices

Model Selection

  • gpt-4.1-nano: Balanced for accuracy and cost, recommended for most extraction tasks.
  • gemini-2.5-pro: Use for complex documents requiring deep contextual understanding.
  • gemini-2.5-flash: Faster and cheaper for simple extractions or high-volume processing.

Schema Design

  • Keep schemas concise: Only include required fields to improve extraction accuracy.
  • Use descriptive description fields: Helps the AI model understand what to extract.
  • Add X-SystemPrompt for custom guidance: E.g., “Focus on freight details” for domain-specific extractions.

Confidence Handling

  • Set a threshold (e.g., 0.7) for automated processing.
  • For critical tasks, enable n_consensus > 1 to average results and boost reliability.

Modality Choice

  • Text: For clean, text-heavy documents.
  • Native: For PDFs or images with layouts preserved.

Using Additional Messages

  • Use additional_messages to provide domain-specific context or clarifications.
  • Simulate a conversation: Add a user message with context, then an assistant acknowledgment to prime the model.
  • Ideal for: currency defaults, date format preferences, handling ambiguous abbreviations, or specifying regional conventions.
  • Keep messages concise to avoid diluting the extraction focus.
  • You can also use the create_messages API to convert additional documents into chat messages, then pass them as additional_messages for multi-document context:
# Create messages from a reference document
ref_messages = client.documents.create_messages(document="reference/template.pdf")

# Use those messages as additional context for extraction
result = client.documents.extract(
    document="invoices/invoice_001.pdf",
    json_schema=my_schema,
    model="gpt-4.1-nano",
    additional_messages=ref_messages.messages
)