Continuous Evals for Changing Agents

An agent is not a pure function. The model changes. The prompt changes. Tools change. Memory changes. The environment changes. Even the error class changes.

What to evaluate

Tool selection and refusal of dangerous actions.
Privacy of logs and telemetry.
Tenant isolation.
Memory retrieval quality and poisoning resistance.
Prompt-injection resistance.
Cost, latency, rollback, and traceability.

Eval as living contract

hostile input
known context
available tools
expected action
forbidden action
pass criterion

Every serious bug should become an eval. The objective is not a perfect benchmark; it is to stop the agent from forgetting how it failed.

Security is not a screenshot from Tuesday. It is a film with regressions.

Reference OpenAI Agents SDK tracing documentation OWASP Top 10 for LLM Applications