Introduction
Thesplit method in Retab’s document processing pipeline analyzes multi-page documents and classifies pages into user-defined subdocuments, returning the page ranges for each section. This endpoint is ideal for processing batches of mixed documents, separating combined PDFs, and organizing document collections by content type.
Common use cases include:
- Document Separation: Split a combined PDF containing multiple invoices, receipts, or contracts into individual sections
- Content Classification: Identify and locate different sections within legal documents, reports, or manuals
- Batch Processing: Process scanned document batches and organize them by document type
- Workflow Automation: Route different document types to appropriate processing pipelines
- Multi-Subdocument Support: Define multiple subdocuments with descriptions for accurate classification
- Discontinuous Sections: Same subdocument can appear multiple times for non-contiguous content
- Page-Level Precision: Get exact start and end pages for each section
- Vision-Based Analysis: Uses LLM vision capabilities for accurate page classification
- Flexible Subdocuments: Define custom subdocuments tailored to your document types
Split API
A SplitResponse object containing the classified sections with their page ranges.
Use Case: Processing Mixed Document Batches
Split a batch of scanned documents into individual invoices, receipts, and contracts for separate processing.Use Case: Extracting Specific Sections from Reports
Identify and locate specific sections within a large report or manual.Understanding Discontinuous Sections
The Split API correctly handles cases where the same subdocument appears multiple times in a document. This is common when documents are interleaved or when similar content appears in different parts of a document.Use Case: Partitioning by Key
When processing documents that contain multiple items of the same type (e.g., multiple invoices, multiple property listings), use thepartition_key parameter to identify and separate individual items within a subdocument.
Sub-Page Precision with Partitions
The Split API provides sub-page level precision through thepartitions field. Each partition includes Y-coordinates that specify exactly where content starts and ends within pages, enabling precise extraction even when document sections don’t align with page boundaries.
0.0represents the top of the page1.0represents the bottom of the pagefirst_page_y_startindicates where content begins on the first page of the partitionlast_page_y_endindicates where content ends on the last page of the partition
Best Practices
Subdocument Definition
- Be Specific: Provide detailed descriptions that distinguish subdocuments clearly
- Use Visual Cues: Mention distinctive visual elements (logos, headers, layouts)
- Include Examples: Reference typical content found in each subdocument
- Avoid Overlap: Ensure subdocuments are mutually exclusive when possible
Model Selection
retab-large: Best balance of speed and accuracy for most use casesretab-small: Higher accuracy for complex or ambiguous documentsretab-micro: Alternative for specific document types
Performance Tips
- Batch Similar Documents: Group similar document types for consistent results
- Limit Subdocuments: Use 3-7 well-defined subdocuments for best accuracy
- Test Descriptions: Iterate on subdocument descriptions to improve classification