Datasets & experiments
Datasets scale evaluation from one trace to a whole test set. Build a dataset (seed items from real traces or add them by hand), then run it against a model — each item is replayed and, optionally, scored by the judge, with aggregate cost/latency/score per run.
Manage everything under Datasets in the dashboard, or use the API.
Build a dataset
# Create a datasetDS=$(curl -s -X POST localhost:3100/api/datasets \ -d '{"name":"regressions"}' -H 'content-type: application/json' | jq -r .dataset.id)
# Seed an item from a real trace's captured promptcurl -X POST localhost:3100/api/datasets/$DS/items/from-trace \ -H 'content-type: application/json' -d '{"traceId":"<id>"}'
# ...or add items by handcurl -X POST localhost:3100/api/datasets/$DS/items \ -H 'content-type: application/json' \ -d '{"input":{"messages":[{"role":"user","content":"2+2?"}]},"expected":"4"}'Items are { input, expected? }; add many at once with { items: [...] }.
Run the set
# Run the whole set against a model, judging each outputcurl -X POST localhost:3100/api/datasets/$DS/run \ -H 'content-type: application/json' \ -d '{"model":"gpt-4o-mini","judge":true,"metric":"correctness"}'The run body is { model, judge?, metric? }. A run requires a provider key in
memory (see Replay). Each item is replayed
against the model and, if judge is true, scored by the LLM judge.
Inspect results
# List datasets, or a dataset's items + runscurl localhost:3100/api/datasetscurl localhost:3100/api/datasets/$DS
# A single run's summary + per-item resultscurl localhost:3100/api/runs/<runId>Each run aggregates cost, latency and score across items, so you can compare two models — or the same model over time — on the same test set. The dashboard renders runs as A/B comparisons.