target, source, assertion, the
run-record shape, and the seven values of the run-record status enum. For
the conceptual mental model and the dashboard workflow, see
Tests.
What’s in a test
AWorkflowTest is a Pydantic model with three meaningful sections:
target — what the test runs against
A discriminated union by type:
type | Fields | Meaning |
|---|---|---|
block | block_id | Run the test against a single block in the workflow. |
block is the only variant today. The shape is a discriminated union so
workflow-level targets (e.g. { type: "workflow" } running every block end-to-end)
can be added later without renaming the field at every callsite.
source — where the inputs come from
Also a discriminated union by type:
type | Fields | Meaning |
|---|---|---|
manual | handle_inputs: { [handle_id]: HandleInput } | Hand-written inputs. Use for synthetic test cases. |
run_step | run_id: str, step_id?: str | Replay the inputs the block actually received during a previous workflow run. step_id is required for blocks executed inside a for_each (each iteration is its own step). |
run_step, Retab snapshots the inputs at create
time — subsequent edits to the source workflow run don’t affect the test.
File handles in the snapshot are materialized as durable Retab file refs so
the test still runs months later even if the original upload session is gone.
assertion — required, one per test
assertion.target always names a declared output handle (output_handle_id)
and an optional dotted path inside that handle’s payload. See the operator
catalog below.
Available condition.kind values
| Kind | Use | Notes |
|---|---|---|
exists / not_exists | Path is/isn’t present | Treats missing keys, out-of-bounds indices, and traversals through null as “not there” — does NOT block. |
equals | Strict deep equality | Does NOT conflate True == 1 or False == 0. Strips reasoning___* keys before comparing (LLM extractor sidecars). |
compare | <, <=, >, >= for numbers | |
contains | Substring on strings; element on lists | If expected is a dict, use array_contains instead. |
array_contains | Subset matching for list-of-dicts | |
object_contains | Subset matching for dicts | Same reasoning___* strip as equals. |
length_compare | Length operator on strings / lists / dicts | |
matches_regex | re.fullmatch | Use .*foo.* for substring search. |
validates_json_schema | Validate a subtree against a JSON Schema | |
all_items_match | Every list item matches the inner condition | Passes vacuously on empty arrays. |
any_item_matches | Some list item matches | Fails on empty arrays. |
similarity_gte | Embedding-similarity threshold | Async; can produce error status. |
llm_judged_as / llm_not_judged_as | LLM rubric check | Async; can produce error status. |
split_iou_gte | IOU threshold for split manifests |
Two distinct status enums — don’t confuse them
A test surfaces TWO status fields. Most surprises with the API trace back to mixing them up.assertion_result.status (4 values)
The outcome of evaluating ONE assertion against the block’s output.
| Status | Meaning |
|---|---|
passed | The operator was evaluated and matched. |
failed | The operator was evaluated and did not match. The result includes actual_value and a failure.code / failure.message. |
blocked | The assertion couldn’t be evaluated. The block replay failed, the output handle isn’t declared, or the path hit a type error / bad selector. The failure includes details.partial_path / details.partial_value pointing at the deepest valid prefix. |
error | An async operator (similarity_gte, llm_judged_as) couldn’t run for environmental reasons (e.g. embedding service unreachable). |
Run-record status (7 values)
The status of a TEST RUN — aggregates the assertion result with execution-side state. This is what appears onWorkflowTestResult.status,
latest_run_summary.status, and per-test result rows returned from
/v1/workflows/tests/results?run_id={run_id}.
| Status | Meaning |
|---|---|
queued | The run has been scheduled but the worker hasn’t picked it up yet. |
running | The worker is mid-execution. |
passed | Execution finished and the assertion passed. |
failed | Execution finished and the assertion failed. |
blocked | Execution finished but the assertion was blocked (see above). Counted distinctly from failed because the user typically needs to fix the test definition or the block’s outputs, not the assertion expectation. |
error | Execution itself failed (block raised, replay timed out, etc.) before any assertion could be evaluated. |
cancelled | The user or a downstream system cancelled the run. |
id. Expect transient pending /
running parent-run lifecycle states before terminal counts and per-test
results are available.
Workflow-test execution uses the returned run id for polling, cancellation,
and result inspection.
Run records
AWorkflowTestResult is the immutable snapshot of one execution. The
fields most consumers care about:
outputs (renamed from handle_outputs)
Run records before the May 2026 API rewrite stored output: Any (the raw block
return blob) and handle_outputs: { [handle_id]: any } (per-handle outputs).
The new shape collapses these into a single field:
handle_outputs → outputs on first server
startup, so reads through this field always work. The legacy output and
handle_outputs fields are intentionally kept on legacy docs for forensic
debugging — drop them via a later cleanup migration once nothing reads them.
Fingerprints
Three deterministic hashes pinned per run record:| Fingerprint | Computed from | Used for |
|---|---|---|
handle_inputs_fingerprint | The captured handle inputs | Detecting “we already ran this exact input” — drives the cache hit at runner start. |
workflow_draft_fingerprint | The full workflow draft DAG | Telling you whether the run was against the current draft or a stale one. |
block_config_fingerprint | The single block’s resolved config | Same as above, scoped to just the tested block — gives finer-grained staleness signals. |
execution_fingerprint | A combined hash | Cache key for “this exact (inputs, draft, block) combination” — re-runs that match all three return the cached record. |
Schema drift and staleness
When the workflow draft changes (schema edited, block config tweaked), tests captured against the old draft get aschema_drift status other than none:
schema_drift | Meaning |
|---|---|
none | The captured assertion target still resolves to the same subtree shape in the current draft. |
partial | Some assertion paths still resolve, others don’t. The test will likely produce a blocked result. |
drifted | The output handle or its schema is no longer compatible. Re-capture before running. |
unknown | Drift couldn’t be determined (e.g. block missing, fingerprint absent on a legacy doc). |
GET response always reflects the current
draft, not the draft at create time.
Async execution
Test execution is asynchronous. The flow:POST /v1/workflows/tests/runswithworkflow_idin the request body returns a run object immediately.- Poll
GET /v1/workflows/tests/runs/{run_id}untillifecycle.statusiscompleted,error, orcancelled, then fetch results fromGET /v1/workflows/tests/results?run_id={run_id}. - Workflow-test run results shape:
counts— one bucket per run-record status (7 fields)data[]— one result per test, keyed bytest_idwithin the parent run.
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST | /v1/workflows/tests | Create |
GET | /v1/workflows/tests?workflow_id={workflow_id} | List |
GET | /v1/workflows/tests/{test_id} | Get |
PATCH | /v1/workflows/tests/{test_id} | Update |
DELETE | /v1/workflows/tests/{test_id} | Delete |
POST | /v1/workflows/tests/runs | Create Run |
GET | /v1/workflows/tests/runs | List Runs |
GET | /v1/workflows/tests/runs/{run_id} | Get Run |
POST | /v1/workflows/tests/runs/{run_id}/cancel | Cancel Run |
GET | /v1/workflows/tests/results?run_id={run_id} | List Results |
MCP
Every endpoint above is also exposed as an MCP tool (workflows_tests_create,
workflows_tests_list, workflows_tests_get, workflows_tests_update,
workflows_tests_delete, workflows_tests_runs_create,
workflows_tests_runs_get, workflows_tests_results_list). The tool
input schemas match the request bodies documented above. The MCP layer
additionally rejects
the pre-rewrite top-level block_id / run_id / step_id / handle_inputs
fields with a per-field migration hint pointing at the new shape (e.g.
block_id → use 'target.block_id').
See the MCP server page for how to register the tools with a
Claude / OpenAI agent.