Parsing

Introduction

The parse method turns a document into normalized text content, returned both page-by-page and as one combined string. It is the right tool when you need readable document text for RAG pipelines, search indexing, prompting, debugging, or any workflow that works on free text rather than schema-constrained extraction. Unlike extract, parse does not try to fit the document into a JSON schema. Instead, it returns:

pages: one parsed string per page
text: the full document content as a single string
document: basic file metadata
usage: page count and credits consumed

Table content can be rendered as html, markdown, yaml, or json, depending on what your downstream system expects. For chunking, chonkie is a good fit for RAG-style pipelines.

Parse API

ParseRequest

Show properties

document

MIMEData

required

The document to parse. The HTTP API accepts MIMEData. The SDK also accepts convenient local inputs such as file paths, file-like objects, images, buffers, and URLs, then converts them for you.

model

string

default:"retab-small"

The model used for parsing.

table_parsing_format

"markdown" | "yaml" | "html" | "json"

default:"html"

Controls how tables are represented in the parsed text.

image_resolution_dpi

integer

default:"192"

DPI used when rasterizing pages for OCR-backed parsing. Accepted values are 96 to 300.

Returns

ParseResponse

A parsed document payload with text content and usage metadata.

Show properties

document

BaseMIMEData

Processed document metadata with id, filename, and mime_type.

usage

RetabUsage

Processing usage information including page_count and credits.

pages

array[string]

Parsed content for each page.

text

string

Full document content as a single string.

Use Case: Preparing Documents For RAG

This pattern is useful when you want Retab to handle document parsing and your application to handle chunking and indexing.

from retab import Retab
from chonkie import SentenceChunker

client = Retab()

result = client.documents.parse(
    document="technical-manual.pdf",
    model="retab-small",
    table_parsing_format="markdown",
    image_resolution_dpi=192,
)

chunker = SentenceChunker(
    tokenizer_or_token_counter="gpt2",
    chunk_size=512,
    chunk_overlap=128,
    min_sentences_per_chunk=1,
)

all_chunks = []
for page_num, page_text in enumerate(result.pages, start=1):
    chunks = list(chunker(page_text))
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(
            {
                "page": page_num,
                "chunk_id": f"page_{page_num}_chunk_{chunk_idx}",
                "text": str(chunk),
                "document": result.document.filename,
            }
        )

print(f"Created {len(all_chunks)} chunks from {result.usage.page_count} pages")

Best Practices

When To Use Parse

Use parse when your downstream system wants readable text.
Use extract when you need typed fields that match a schema.

Picking A Table Format

Use markdown for chunking, prompting, and most RAG pipelines.
Use html when preserving table structure matters more than readability.
Use json or yaml when another parser will consume the table output directly.

Choosing DPI

Start with 192 DPI for general-purpose parsing.
Drop to 96 DPI when throughput matters more than OCR quality.
Increase toward 300 DPI for scans, fine print, or low-quality images.

Indexing Advice

Store the page number with every chunk you create from result.pages.
Keep the original document.id or document.filename alongside indexed text so retrieval results remain traceable.

Overview

Core Concepts

Consensus

Workflows

Projects

Evals

Introduction

Parse API

Use Case: Preparing Documents For RAG

Best Practices

When To Use Parse

Picking A Table Format

Choosing DPI

Indexing Advice

Overview

Core Concepts

Consensus

Workflows

Projects

Evals

​Introduction

​Parse API

​Use Case: Preparing Documents For RAG

​Best Practices

​When To Use Parse

​Picking A Table Format

​Choosing DPI

​Indexing Advice

Introduction

Parse API

Use Case: Preparing Documents For RAG

Best Practices

When To Use Parse

Picking A Table Format

Choosing DPI

Indexing Advice