Skip to main content

Introduction

The Files API lets you upload, manage, and retrieve documents stored in Retab. Files are the foundation of document processing: once uploaded, a file can be reused across classify, split, extract, parse, and workflow calls without sending the bytes again. The module exposes four methods:
MethodPurpose
uploadUpload a document and receive a durable MIMEData reference for future requests.
listList uploaded files with pagination, filename prefix search, and MIME type filtering.
getRetrieve metadata for a single file by ID.
get_download_linkGet a temporary signed URL (60 min) to download the original file.

Uploading files

SDK uploads use a direct-to-storage flow. The SDK first creates an upload session, uploads the bytes to the signed storage URL, then completes the upload and returns MIMEData.
from retab import Retab
from pathlib import Path

client = Retab()

# Create an upload session for a local file.
invoice_path = Path("invoice.pdf")
session = client.files.create_upload(
    filename=invoice_path.name,
    size_bytes=invoice_path.stat().st_size,
    content_type="application/pdf",
)
mime_data = session.mime_data
print(f"Filename: {mime_data.filename}")
print(f"URL: {mime_data.url}")

The returned url has the form https://storage.retab.com/file_.... It is an opaque Retab URL, not a public signed URL, and can be passed to later processing requests without sending the file bytes again.

Large documents: avoid inline uploads

When you pass a local file path directly to an SDK processing call, the SDK may send the document as inline MIME/base64 data. This is convenient for small files, but large scanned PDFs can make the request body too large and trigger 413 Request Entity Too Large. For large documents, use one of these URL-backed flows instead:
  1. Preferred: use your own object-storage URL. Retab fetches the file server-side, so the document bytes are not sent inline in the API request. Use a time-limited signed URL when the object is private.
  2. Alternative: upload to Retab first. The SDK uploads the file once, then you pass the returned Retab storage URL to classify, split, extract, parse, or workflow calls.
URL-backed remote documents are streamed into Retab storage and capped at 2 GiB (2,147,483,648 bytes) per document.

Option 1: object-storage URL

Pass an HTTPS URL from object storage directly as the document. Supported remote URL hosts include:
ProviderSupported URL shape
Azure Blob Storagehttps://<account>.blob.core.windows.net/...
Google Cloud Storagehttps://storage.googleapis.com/... or https://<bucket>.storage.googleapis.com/...
Amazon S3https://<bucket>.s3.<region>.amazonaws.com/... or other amazonaws.com S3 URLs
Cloudflare R2https://<account>.r2.cloudflarestorage.com/... and public https://<public-id>.r2.dev/... URLs
Custom domains are not fetched by default. Contact support if you need a custom storage hostname allowlisted. For private files, generate a signed URL with enough time for Retab to fetch the document.
from retab import Retab

client = Retab(api_key="YOUR_RETAB_API_KEY")

schema = {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
},
}

azure_blob_url = "https://<account>.blob.core.windows.net/<container>/large_document.pdf?<sas_token>"

extraction = client.extractions.create(
document=azure_blob_url,
model="retab-small",
json_schema=schema,
)

print(extraction.output)

Option 2: upload to Retab, then reuse the URL

If you do not have an object-storage URL available, upload the file to Retab first and use the returned mime_ref.url.
from retab import Retab

client = Retab(api_key="YOUR_RETAB_API_KEY")

session = client.files.create_upload(
    filename="large_document.pdf",
    size_bytes=12345,
    content_type="application/pdf",
)
mime_ref = session.mime_data

extraction = client.extractions.create(
document=mime_ref.url,
model="retab-small",
json_schema={
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
},
},
)

You can also pass document=mime_ref directly. Passing mime_ref.url is equivalent for Retab storage URLs: the backend parses the file ID and resolves it against the authenticated caller’s organization before processing.

Security model

Signed object-storage URLs are bearer URLs controlled by you. Keep them time-limited and scoped to the single document being processed. Public object-storage URLs, such as public Cloudflare R2 r2.dev URLs, can also be fetched but are not access-restricted by the URL itself. Retab storage URLs such as https://storage.retab.com/file_... are different: they are opaque Retab file references, not public download links. Retab resolves the file ID against the authenticated caller’s organization. If the file is missing, belongs to another organization, or is not fully uploaded, the request is rejected.

The file data structure

File Object
object
File Object
{
  "id": "file_a1b2c3d4e5f6",
  "object": "file",
  "filename": "invoice.pdf",
  "page_count": 3,
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z"
}

Listing and filtering

Use list to browse uploaded files with id-based pagination:
# List recent files
files = client.files.list(limit=20)
for f in files:
    print(f"{f.id}: {f.filename}")

# Filter by filename prefix

pdfs = client.files.list(filename="invoice", mime_type="application/pdf")

Downloading files

Retrieve a time-limited signed URL to download the original file:
link = client.files.get_download_link("file_a1b2c3d4e5f6")
print(f"Download URL: {link.download_url}")
print(f"Expires in: {link.expires_in}")