The adapter model
An Inspect adapter exposes six components. Every adapter you connect to Promptbeat — whether Terminal-Bench, Harbor, or a custom benchmark — must implement all six:| Component | Responsibility |
|---|---|
| Task instruction | The natural-language prompt delivered to the agent. Describes what the agent should accomplish without revealing how success is measured. |
| Workspace | The controlled file or directory environment the agent runs in. Seeded with task-specific files, code, or data before each case. |
| Setup and reset | Hooks that prepare the workspace before a case runs and restore it to a clean baseline afterward. Guarantees reproducibility across cases. |
| Hidden tests | Verifier tests invisible to the agent. Run after the agent completes its task to check correctness, forbidden behavior, or workspace state. |
| Scorer | Reads hidden test output and produces a structured pass/fail judgment with supporting evidence. The scorer result feeds into Promptbeat’s report. |
| Artifact collection | Captures shell transcripts, file diffs, tool call logs, and scorer output. Promptbeat normalizes these into a common trace contract for reporting. |
Terminal-Bench
Terminal-Bench is the first benchmark integrated as a Target Lab adapter. It provides a real terminal environment with programming-assistant tasks, making it a natural fit for evaluating coding agents against instruction-following and safety scenarios. What Terminal-Bench supplies as an adapter:- Real terminal environment — the agent runs actual shell commands in a sandboxed workspace.
- Task instructions — structured programming tasks drawn from the Terminal-Bench dataset.
- Hidden test scorers — test suites that verify correctness and detect unsafe behavior without being visible to the agent during execution.
Promptfoo bridge
Promptfoo continues to play a role in Inspect-based evaluations. After thegenerate stage produces adversarial probes, the Promptfoo bridge passes those cases to the Inspect runner and the Target Lab adapter for execution.
Specifically, Promptfoo handles:
- Generated red-team cases — the adversarial probes from
promptbeat generatebecome the task instructions fed into the Inspect environment. - Coding-agent assertions — Promptfoo assertion types apply to agent outputs before they reach the Inspect scorer.
- Report compatibility — results flow back through Promptfoo’s reporting layer so you get a unified report alongside any non-Inspect cases in the same evaluation run.
- Provider-based targets — for cases that do not need environment control, Promptfoo’s provider bridge handles direct LLM and HTTP targets in the same pipeline.
Writing an adapter
To connect a custom benchmark environment to Promptbeat via Inspect, implement the six adapter components described in the adapter model section above. In practice this means providing:- Setup script — a script or declarative config that seeds the workspace with the files and services your tasks require.
- Execution interface — the interface through which Inspect launches the agent and captures its interaction with the environment (shell session, tool call log, API trace).
- Hidden test suite — test files or a test runner that Inspect invokes after the agent finishes. These must not be visible to the agent at runtime.
- Scorer implementation — a function or script that reads hidden test results and returns a structured judgment Promptbeat can include in its report.
- Artifact capture config — declarations for which files, logs, and outputs Inspect should collect and forward to Promptbeat.
- Reset hook — a teardown step that returns the workspace to its baseline state before the next case runs.
./bin/promptbeat validate to confirm the configuration is complete.
Inspect integration is designed for advanced use cases that require real environment control. If you are evaluating an LLM or HTTP agent without filesystem or terminal requirements, start with direct Codex SDK or HTTP agent targets — those cover the majority of safety evaluation scenarios with far less setup.