An agent is not a pure function. The model changes. The prompt changes. Tools change. Memory changes. The environment changes. Even the error class changes.

What to evaluate

  • Tool selection and refusal of dangerous actions.
  • Privacy of logs and telemetry.
  • Tenant isolation.
  • Memory retrieval quality and poisoning resistance.
  • Prompt-injection resistance.
  • Cost, latency, rollback, and traceability.

Eval as living contract

hostile input
known context
available tools
expected action
forbidden action
pass criterion

Every serious bug should become an eval. The objective is not a perfect benchmark; it is to stop the agent from forgetting how it failed.

Security is not a screenshot from Tuesday. It is a film with regressions.