What are Experiments?
Experiments are controlled, block-level evaluations for workflow blocks. They run the same block with multiple consensus passes over the same set of documents and use agreement between those passes as the quality signal. Use experiments when you want to answer questions like:- Did this schema change make extraction fields more stable?
- Which invoice documents are causing low agreement?
- Which split category or classifier category is ambiguous?
- Did a prompt, category, or split-definition change improve the block?
Experiments vs Tests
Tests and experiments both help you keep workflow changes under control, but they answer different questions.| Tool | Best for | Signal |
|---|---|---|
| Tests | Checking a specific expected output | Pass, fail, or error against an assertion |
| Experiments | Measuring output stability across documents | Consensus score and disagreement details |
Supported Blocks
Experiments are currently supported for:| Block | What Retab measures |
|---|---|
| Extract | Field-level agreement for extracted JSON values |
| Split | Agreement on subdocument/page assignments |
| Classifier | Agreement on routing category decisions |
| For Each | Key-level agreement when the block is configured as split-by-key |
How Experiments Work
An experiment is attached to one block in one workflow. It stores:- The block under test - the workflow block id and block kind.
- A fixed document set - materialized block inputs captured from completed workflow runs or from files uploaded while creating the experiment.
- A consensus count - 3, 5, or 7 independent passes.
| Block | Artifact |
|---|---|
| Extract | Extraction |
| Split | Split |
| Classifier | Classification |
| For Each split-by-key | Partition |
| Block | Target |
|---|---|
| Extract | Field |
| Split | Subdocument |
| Classifier | Category |
| For Each split-by-key | Key |
Creating an Experiment
- Open a workflow in the dashboard.
- Go to Console -> Experiments.
- Click Experiment.
- Name the experiment.
- Choose the number of consensus passes: 3, 5, or 7.
- Select the block to evaluate.
- Select files from completed runs, or upload files for the experiment.
- Create the experiment.
Reading Results
The experiment detail page has three sections:| Section | Purpose |
|---|---|
| Config | Review or edit the block configuration being evaluated. |
| Data | Inspect per-document outputs and the underlying artifacts. |
| Metrics | Analyze experiment-run scores, weak targets, document-level failures, and voter disagreements. |
- Summary shows the overall experiment score, target averages, document averages, and previous-run delta when available.
- By document shows which files are least stable.
- By target shows which fields, categories, subdocuments, or keys are least stable across the document set.
- Votes shows the individual candidate outputs from an experiment run for one document-target cell, including the agreed value and disagreements.
Staleness and Re-runs
Experiment metrics belong to a specific experiment run, block configuration, and document set, and are read from the experiment-run metrics API. If you edit the block or change the experiment documents, Retab marks the latest run’s metrics as stale. If the output schema changes, Retab can also report schema drift. When an experiment is stale, run it again to refresh the score against the current workflow draft. Retab keeps run history, so you can compare the latest score with earlier runs and see whether a configuration change improved or degraded the block.Recommended Workflow
- Run the workflow with representative documents.
- Create an experiment for an Extract, Split, Classifier, or split-by-key For Each block.
- Start with 3 consensus passes while iterating quickly.
- Inspect the lowest-scoring documents and targets.
- Adjust the schema, prompt, categories, or split definitions.
- Re-run the experiment and compare the score with the previous run.
- Add workflow tests for outputs that should now be protected with explicit assertions.
Using the SDK
The dashboard flow above maps onto a small set of SDK calls. The same calls back the MCP tool surface, so anything you can do interactively or through an agent you can do programmatically.Create an experiment
Pick a supported block, give the experiment a name and a document set, and choose how many consensus passes per document (3, 5, or 7). Documents come from prior workflow runs (viadocument_captures) or as explicit handle inputs.
Run the experiment
Trigger consensus runs against the current draft block config. This is async; the SDK returns a run resource that you inspect through the experiment runs API.n_consensus and document set.
Inspect the run and results
Poll the run untillifecycle.status is terminal, then read the per-document
results produced by that run.
Read metrics
Metrics live under the experiment run with four views — start atsummary, drill
into by_target on low-scoring fields, then into votes to see voter
disagreement on a specific cell. These are experiment-run metric views under
/v1/workflows/experiments/metrics?run_id={run_id}.
Metric responses use kind as the shared discriminator across successful views
and data-state payloads. Successful payloads also echo view for compatibility;
stale or missing metrics return kind: "stale_metrics" or kind: "no_metrics".
runs.metrics.get(...) returns a kind: "stale_metrics" data-state envelope — call
runs.create(...) to recompute and then inspect the new run’s metrics.
Update / delete
AsyncRetab.workflows.experiments), see the
API reference.