An agent is not a pure function. The model changes. The prompt changes. Tools change. Memory changes. The environment changes. Even the error class changes.
What to evaluate
- Tool selection and refusal of dangerous actions.
- Privacy of logs and telemetry.
- Tenant isolation.
- Memory retrieval quality and poisoning resistance.
- Prompt-injection resistance.
- Cost, latency, rollback, and traceability.
Eval as living contract
hostile input
known context
available tools
expected action
forbidden action
pass criterion
Every serious bug should become an eval. The objective is not a perfect benchmark; it is to stop the agent from forgetting how it failed.
Security is not a screenshot from Tuesday. It is a film with regressions.