Experiments - Retab Docs

What are Experiments?

Experiments are controlled, block-level evaluations for workflow blocks. They run the same block with multiple consensus passes over the same set of documents and use agreement between those passes as the quality signal. Use experiments when you want to answer questions like:

Did this schema change make extraction fields more stable?
Which invoice documents are causing low agreement?
Which split category or classifier category is ambiguous?
Did a prompt, category, or split-definition change improve the block?

Experiments do not require ground-truth labels. They are consensus evals: Retab asks the block to produce several independent candidate outputs, compares those candidate outputs, and reports where the candidates agree or disagree. A higher score means stronger agreement. A low score points to an unstable document, field, category, subdocument type, or split-by-key partition.

Experiments vs Tests

Tests and experiments both help you keep workflow changes under control, but they answer different questions.

Tool	Best for	Signal
Tests	Checking a specific expected output	Pass, fail, or error against an assertion
Experiments	Measuring output stability across documents	Consensus score and disagreement details

Use a test when you know what the output should be. Use an experiment when you want to find weak spots, compare block configurations, or inspect whether a block is internally consistent before you write stricter assertions.

Supported Blocks

Experiments are currently supported for:

Block	What Retab measures
Extract	Field-level agreement for extracted JSON values
Split	Agreement on subdocument/page assignments
Classifier	Agreement on routing category decisions
For Each	Key-level agreement when the block is configured as split-by-key

Other workflow blocks can still be tested with workflow tests, but they do not currently produce experiment metrics.

How Experiments Work

An experiment is attached to one block in one workflow. It stores:

The block under test - the workflow block id and block kind.
A fixed document set - materialized block inputs captured from completed workflow runs or from files uploaded while creating the experiment.
A consensus count - 3, 5, or 7 independent passes.

When you run the experiment, Retab freezes the current block configuration and replays the selected block for each document. Each document execution becomes an experiment result under the parent experiment run. The result stores the canonical artifact produced by the block:

Block	Artifact
Extract	Extraction
Split	Split
Classifier	Classification
For Each split-by-key	Partition

Experiment-run metrics are normalized across the same shape for every supported block:

document x target x voter

The target depends on the block type:

Block	Target
Extract	Field
Split	Subdocument
Classifier	Category
For Each split-by-key	Key

This lets the dashboard show the same core views for different block types: overall summary, by-document scores, by-target scores, and voter-level disagreement.

Creating an Experiment

Open a workflow in the dashboard.
Go to Console -> Experiments.
Click Experiment.
Name the experiment.
Choose the number of consensus passes: 3, 5, or 7.
Select the block to evaluate.
Select files from completed runs, or upload files for the experiment.
Create the experiment.

When you create an experiment from the dashboard, Retab creates the experiment definition and immediately starts an experiment run for it. Metrics are always scoped to that run, so the experiment page updates while the run is pending or running.

Reading Results

The experiment detail page has three sections:

Section	Purpose
Config	Review or edit the block configuration being evaluated.
Data	Inspect per-document outputs and the underlying artifacts.
Metrics	Analyze experiment-run scores, weak targets, document-level failures, and voter disagreements.

The metrics views help you move from broad signal to specific evidence:

Summary shows the overall experiment score, target averages, document averages, and previous-run delta when available.
By document shows which files are least stable.
By target shows which fields, categories, subdocuments, or keys are least stable across the document set.
Votes shows the individual candidate outputs from an experiment run for one document-target cell, including the agreed value and disagreements.

Split and classifier experiments also expose specialized visualizations, such as confusion-style views, to make routing and page-assignment ambiguity easier to inspect.

Staleness and Re-runs

Experiment metrics belong to a specific experiment run, block configuration, and document set, and are read from the experiment-run metrics API. If you edit the block or change the experiment documents, Retab marks the latest run’s metrics as stale. If the output schema changes, Retab can also report schema drift. When an experiment is stale, run it again to refresh the score against the current workflow draft. Retab keeps run history, so you can compare the latest score with earlier runs and see whether a configuration change improved or degraded the block.

Recommended Workflow

Run the workflow with representative documents.
Create an experiment for an Extract, Split, Classifier, or split-by-key For Each block.
Start with 3 consensus passes while iterating quickly.
Inspect the lowest-scoring documents and targets.
Adjust the schema, prompt, categories, or split definitions.
Re-run the experiment and compare the score with the previous run.
Add workflow tests for outputs that should now be protected with explicit assertions.

Experiments work best as a discovery and comparison tool. They tell you where a block is uncertain; tests then lock in the behaviors you decide are correct.

Using the SDK

The dashboard flow above maps onto a small set of SDK calls. The same calls back the MCP tool surface, so anything you can do interactively or through an agent you can do programmatically.

Create an experiment

Pick a supported block, give the experiment a name and a document set, and choose how many consensus passes per document (3, 5, or 7). Documents come from prior workflow runs (via document_captures) or as explicit handle inputs.

from retab import Retab

client = Retab()

experiment = client.workflows.experiments.create(
    workflow_id="wf_abc123",
    block_id="extract-invoice",
    name="Q1 invoices",
    document_captures=[
        {"run_id": "wfrun_1"},
        {"run_id": "wfrun_2", "step_id": "for_each-0"},
    ],
    n_consensus=5,
)

Creating an experiment does NOT trigger a run — the document set is registered but no metrics exist yet.

Run the experiment

Trigger consensus runs against the current draft block config. This is async; the SDK returns a run resource that you inspect through the experiment runs API.

run = client.workflows.experiments.runs.create(
    workflow_id="wf_abc123",
    experiment_id=experiment.id,
)

print(run.id, run.lifecycle.status)

Runs use the experiment’s stored n_consensus and document set.

Inspect the run and results

Poll the run until lifecycle.status is terminal, then read the per-document results produced by that run.

run = client.workflows.experiments.runs.get(run.id)

results = client.workflows.experiments.runs.results.list(run.id)
for result in results.data:
    print(result.document_id, result.lifecycle.status)

result = client.workflows.experiments.runs.results.get(
    run.id,
    document_id="expdoc_xyz",
)
print(result.artifact)

Read metrics

Metrics live under the experiment run with four views — start at summary, drill into by_target on low-scoring fields, then into votes to see voter disagreement on a specific cell. These are experiment-run metric views under /v1/workflows/experiments/metrics?run_id={run_id}. Metric responses use kind as the shared discriminator across successful views and data-state payloads. Successful payloads also echo view for compatibility; stale or missing metrics return kind: "stale_metrics" or kind: "no_metrics".

summary = client.workflows.experiments.runs.metrics.get(
    run.id,
    view="summary",
)

# A weak field surfaces in summary.aggregate.likelihoods - drill in

target_view = client.workflows.experiments.runs.metrics.get(
    run.id,
    view="by_target",
    target_path="line_items.*.unit_price",
)

# To see what each voter said for one document/target cell

votes = client.workflows.experiments.runs.metrics.get(
    run.id,
    view="votes",
    document_id="expdoc_xyz",
    target_path="line_items.*.unit_price",
)

If a run is stale relative to the current block config or document set, runs.metrics.get(...) returns a kind: "stale_metrics" data-state envelope — call runs.create(...) to recompute and then inspect the new run’s metrics.

Update / delete

# Change the document set or n_consensus — invalidates existing metrics
client.workflows.experiments.update(
    experiment_id=experiment.id,
    n_consensus=7,
)

client.workflows.experiments.delete(
    experiment_id=experiment.id,
)

For the full method reference (including async variants under AsyncRetab.workflows.experiments), see the API reference.

​What are Experiments?

​Experiments vs Tests

​Supported Blocks

​How Experiments Work

​Creating an Experiment

​Reading Results

​Staleness and Re-runs

​Recommended Workflow

​Using the SDK

​Create an experiment

​Run the experiment

​Inspect the run and results

​Read metrics

​Update / delete