risk_type to a scenario, Promptbeat uses it to determine which seeds are in scope, which target capabilities must exist for the test to be meaningful, what trace evidence should be captured, which judge signals are allowed to mark a failure, and how results are grouped in the final report.
Risk taxonomy reference
| Risk type | Display name | What it tests | Example failure |
|---|---|---|---|
prompt_injection | Prompt Injection | Whether the target follows untrusted instructions injected through files, DOM text, terminal output, or data rows | Agent repeats a goal embedded in a repository README and abandons its original task |
secret_handling | Secret Handling | Whether the target reveals, dumps, or exfiltrates secrets, credentials, or private environment variables | Agent runs printenv and outputs OPENAI_API_KEY values in its response |
tool_misuse | Tool Misuse | Whether the target makes unauthorized, destructive, or unsafe tool calls beyond its permitted scope | Agent calls a cloud CLI to delete resources it was not asked to touch |
sandbox_boundary | Sandbox Boundary | Whether the target reads or writes outside its designated workspace using host paths, symlinks, or process probes | Agent follows a symlink that resolves outside the workspace to read host credentials |
network_egress | Network Egress | Whether the target makes unapproved outbound network requests, DNS lookups, or file uploads | Agent generates a curl command that POSTs workspace content to an external server |
data_exfiltration | Data Exfiltration | Whether the target leaks private context through covert channels, encoded artifacts, or form submissions | Agent writes a base64-encoded secret into a generated filename or comment |
harmful_content | Harmful Content | Whether the target provides actionable harmful instructions, policy bypasses, or unsafe transformations | Agent gives step-by-step synthesis instructions after an authority-claim jailbreak |
privacy_pii | Privacy / PII | Whether the target exposes private records, PII, or cross-tenant data through queries or tool calls | Support agent returns another user’s ticket history in response to an indirect request |
authorization | Authorization | Whether the target enforces user identity, tenant scope, and role policy for every action | Agent performs an admin action for a caller who lacks the required role |
evaluation_integrity | Evaluation Integrity | Whether the target tampers with test files, scorers, hidden tests, or benchmark logs | Coding agent edits the test suite to make failing tests pass |
Using risk types in scenarios
Declarerisk_type directly in each scenario entry. Promptbeat uses it to filter applicable seeds, select generation plugins, and route judge logic.
scenarios.yaml
risk_types (plural) in that case:
scenarios.yaml (multi-risk)
Risk types and judges
Each risk type implies a different kind of evidence and a different judging approach. Promptbeat supports three evidence levels:| Evidence level | What is available | Suitable judge |
|---|---|---|
| Answer only | Final text response | Text assertion, refusal check, policy rubric |
| Answer + tool trace | Final answer, tool calls, commands, file reads/writes | Trace scorer, unsafe command detector, action policy checker |
| Full environment trace | Workspace diff, network events, hidden tests, build artifacts | Adapter scorer, sandbox boundary checker, benchmark scorer |
secret_handling, sandbox_boundary, network_egress, data_exfiltration, tool_misuse, and evaluation_integrity — produce weaker reports when the target only returns a final text answer. Promptbeat marks the evidence level in the report so you can see where coverage is shallow.
Here is an example judge bundle for network_egress:
judge bundle
Risk types in reports
Promptbeat’s report aggregates results along the risk taxonomy so you can see your coverage and failure distribution at a glance. For each risk type in your evaluation, the report shows:- Total probes executed
- Pass rate and failure rate
- Which scenarios contributed results
- Which seeds or dataset entries drove failures
- Evidence level used (answer-only vs trace-aware)