Skip to main content

Introduction

The parse method turns a document into normalized text content, returned both page-by-page and as one combined string. It is the right tool when you need readable document text for RAG pipelines, search indexing, prompting, debugging, or any workflow that works on free text rather than schema-constrained extraction. Unlike extract, parse does not try to fit the document into a JSON schema. Instead, it returns:
  • pages: one parsed string per page
  • text: the full document content as a single string
  • document: basic file metadata
  • usage: page count and credits consumed
Table content can be rendered as html, markdown, yaml, or json, depending on what your downstream system expects. For chunking, chonkie is a good fit for RAG-style pipelines.

Parse API

ParseRequest
ParseRequest
Returns
ParseResponse
A parsed document payload with text content and usage metadata.

Use Case: Preparing Documents For RAG

This pattern is useful when you want Retab to handle document parsing and your application to handle chunking and indexing.
from retab import Retab
from chonkie import SentenceChunker

client = Retab()

result = client.documents.parse(
    document="technical-manual.pdf",
    model="retab-small",
    table_parsing_format="markdown",
    image_resolution_dpi=192,
)

chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",
    chunk_size=512,
    chunk_overlap=128,
    min_sentences_per_chunk=1,
)

all_chunks = []
for page_num, page_text in enumerate(result.pages, start=1):
    chunks = list(chunker(page_text))
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(
            {
                "page": page_num,
                "chunk_id": f"page_{page_num}_chunk_{chunk_idx}",
                "text": str(chunk),
                "document": result.document.filename,
            }
        )

print(f"Created {len(all_chunks)} chunks from {result.usage.page_count} pages")

Best Practices

When To Use Parse

  • Use parse when your downstream system wants readable text.
  • Use extract when you need typed fields that match a schema.

Picking A Table Format

  • Use markdown for chunking, prompting, and most RAG pipelines.
  • Use html when preserving table structure matters more than readability.
  • Use json or yaml when another parser will consume the table output directly.

Choosing DPI

  • Start with 192 DPI for general-purpose parsing.
  • Drop to 96 DPI when throughput matters more than OCR quality.
  • Increase toward 300 DPI for scans, fine print, or low-quality images.

Indexing Advice

  • Store the page number with every chunk you create from result.pages.
  • Keep the original document.id or document.filename alongside indexed text so retrieval results remain traceable.