DatasetSpec, and let the pipeline bind records to scenario risk types. This page explains how datasets become probes and how provenance is tracked across the full evaluation.
Two evaluation modes
Promptbeat supports two ways to use a dataset inside an evaluation run.Direct mode
Dataset entries are used as-is as test cases. Each record becomes a test input without any generation step. Use this mode for fast smoke tests and refusal regression checks where you want exact, reproducible prompts.
Dataset-steered generation
Dataset entries become seeds that a generator model expands into scenario-specific probes. The raw prompt is treated as an example or framing hint, not a final input. Use this mode for stronger red-team exploration that reaches target-specific surfaces.
| Mode | What enters eval | Strength | Weakness |
|---|---|---|---|
| Direct dataset eval | Raw dataset prompt as test input | Fast, reproducible, good smoke test | May only test direct refusal, not agent behavior |
| Dataset-steered generation | Dataset prompt becomes a seed for generated probes | Explores variants and target-specific surfaces | Needs stronger judge and trace evidence |
| Dataset + scenario fixtures | Dataset seed combined with repo files, DOM pages, tickets, or cloud fixtures | Best for real agents | Requires adapter work and artifact management |
Evaluation pipeline
Every dataset-driven run follows the same pipeline, regardless of which mode you choose:- Download raw dataset files into
datasets/raw. - Define a
DatasetSpecthat maps prompt, ID, and category fields. - Load records with
DatasetSeedLoader— each record becomes a typedSeed. - Apply a
DatasetRiskMappingto bind seeds to scenario risk types. - Generate or directly evaluate Promptfoo test cases.
- Report by dataset, category, risk type, plugin, and provider.
Dataset subscriptions
Instead of configuring each dataset individually per run, you define a subscription YAML that groups named sources into reusable baselines. Promptbeat loads these at startup and resolves the seed pool before generation or direct eval begins. Below is the completesubscriptions/safety-baseline.yaml that ships with Promptbeat:
subscriptions/safety-baseline.yaml
| Field | Purpose |
|---|---|
name | Matches the dataset’s local name in the catalog (e.g. harmbench, jbb_behaviors) |
limit | Maximum number of records to draw from this source per run |
risk_type | The Promptbeat risk type ID to assign seeds from this source |
categories | Optional list of category values to filter records before loading |
lang | Optional language tag to scope multilingual datasets |
Setting the datasets directory
Promptbeat resolves raw dataset files relative to the directory set byPROMPTBEAT_DATASETS_DIR. Set this before running any dataset-backed evaluation:
Dataset provenance
Promptbeat records the origin of every seed so you can trace any generated case or report row back to its source record. Each seed carries ametadata block that survives through generation, evaluation, and final report output.
dataset— the local dataset name it came fromrecordId— the original row identifier from the raw filecategory— the source dataset’s category label before risk mappingsource— the stringdatasetto distinguish from hand-written seeds
Dataset sources
Promptbeat supports the following dataset families as seed sources:| Dataset family | What it covers |
|---|---|
| HarmBench | Harmful behavior and refusal smoke tests across chemical/biological, cyber, and other harmful-content categories |
| JailbreakBench (JBB) | Jailbreak behavior prompts and robustness checks for policy-bypass scenarios |
| XSTest | Exaggerated-safety and refusal calibration — benign prompts that overly cautious models refuse incorrectly |
| Forbidden Questions | Direct refusal and policy compliance checks across content-policy categories |
| SimpleSafetyTests | Small, direct safety regression set for lightweight smoke tests and baseline sanity checks |
| ALERT | Safety risk prompts across multiple categories for broad safety scenario coverage |
| ToxicChat | Real user jailbreak attempts and toxic inputs, useful for jailbreak-override scenario seeds |
| JADE-DB | Chinese-language jailbreak and harmful-content coverage for Chinese scenarios and taxonomy mapping |
| BeaverTails | Harmlessness and harmfulness preference data, filtered to unsafe-labeled records |
| OR-Bench | Deception and unsafe persuasion prompts, used with a deception category filter |
| SALAD-Bench | Safety alignment benchmark categories including a misinformation-focused adversarial slice |
| Aegis | Unsafe prompt classification dataset with violated-category labels |
| Aya | Multilingual red-teaming prompts with harm category annotations |
| Do-Not-Answer | Refusal and safety policy categories with risk-area labels |