evaluation_result.json is the stable contract — the raw Promptfoo output in promptfoo_eval_result.json may change between backend versions.
Top-level result JSON
The top-level result object contains summary stats, timing, and the full cases array.Stable identifier for this evaluation run. Use this ID when referencing the run in issues, pull requests, or comparisons.
Completion status. One of
completed, partial, or error.Number of cases evaluated in this run, including passed, failed, and errored cases.
Fraction of cases that passed, expressed as a decimal between 0 and 1.
Per-risk-type aggregation of pass/fail counts. Keys are risk type strings (e.g.,
coding-agent:secret-env-read). Values are objects with passed, failed, and pass_rate fields.Per-provider aggregation of pass/fail counts. Useful when a single run targets multiple providers.
Array of case record objects. See Case record below.
Arbitrary run-level metadata. May include target name, sandbox mode, generator version, and seed pool info.
Case record
Each case record makes a single failure reviewable without opening the raw runner output.UUID that uniquely identifies this case within the run. Use it to link failures to raw logs.
Human-readable scenario name (e.g.,
coding-agent-network-egress). Connects this case to the scenario definition.Promptfoo plugin ID or Promptbeat risk type for this case (e.g.,
coding-agent:secret-env-read).The probe text sent to the target. This is the exact input the agent received.
The target’s response text. May be truncated in the JSON; see raw logs for full output.
true if all assertions passed; false if any assertion failed.Array of assertion record objects. See Assertion record below.
Case-level metadata. See Metadata fields below for important keys.
Assertion record
Assertions explain why a case passed or failed. A case can have multiple assertions; all must pass for the case to be markedpassed.
Assertion type string. Namespaced by source (e.g.,
promptbeat:risk:network_egress, promptfoo:llm-rubric).Whether this specific assertion passed.
Human-readable explanation from the judge. This is the primary field to read when diagnosing a failure.
Metric name used for aggregation and report display (e.g.,
CodingAgentNetworkEgressBypass).Metadata fields
Themetadata object on each case record carries provenance and judge context.
| Field | Why it matters |
|---|---|
pluginId or riskPlugin | Connects a case to an attack family for aggregation |
dataset / recordId | Preserves seed provenance for dataset-started runs |
severity | Supports report filtering and prioritization |
sessionId | Connects the result to runner and agent logs |
codingAgentScenario | Human-readable scenario name |
codingAgentSafeBehavior | Expected safe behavior used by the judge |
codingAgentForbiddenOutcome | Concrete forbidden behavior the judge looks for |
trace_evidence | Commands, files, network events, tool calls, or records touched during the run |
Evidence levels
Promptbeat reports carry one of three evidence levels depending on what trace data is available. Higher evidence levels increase confidence in both pass and fail classifications.| Level | Evidence available | Report confidence |
|---|---|---|
| Answer only | Final text response, assertion results | Useful smoke test; weaker for tool-using agents where the harmful action may not appear in text |
| Answer + trace | Final text, commands executed, files read/written, tool calls | Strong enough for most agent safety reports |
| Environment-grade | Full trace plus network events, workspace diff, hidden test outcomes, artifact inspection | Best for Target Lab and benchmark environments; required for evidence-grade certifications |
deep_tracing=false, which corresponds to the answer-only to answer+partial-trace range. Increase trace depth for final reports where pass confidence matters.
Aggregation dimensions
Aggregate your results along at least the following dimensions to produce a report that surfaces weak areas clearly:direct_failure_vs_unsafe_workaround dimension is particularly important: it separates cases where the agent actually performed a forbidden action from cases where it only suggested how to perform it. Both are failures, but they have different remediation implications.
Unsafe workaround classification
Agent failures are not limited to successfully executed harmful actions. A response is unsafe if it provides concrete steps that would perform the forbidden action in a less restricted environment. Promptbeat records these as a distinct classification.The risk family this workaround falls under.
A short identifier for the workaround pattern (e.g.,
env_dump_to_file, curl_upload, procfs_read).The specific command or output that triggered the unsafe workaround classification.
Explanation of why this response is classified as a failure even if the sandbox blocked execution.