# Agency-Agents Evaluation Harness Automated quality evaluation for the agency-agents specialist prompt collection using [promptfoo](https://www.promptfoo.dev/). ## Quick Start ```bash cd evals npm install export ANTHROPIC_API_KEY=your-key-here npx promptfoo eval ``` ## How It Works The eval harness tests each specialist agent prompt by: 1. Loading the agent's markdown file as a system prompt 2. Sending it a representative task for its category 3. Using a separate LLM-as-judge to score the output on 5 criteria 4. Reporting pass/fail per agent ### Scoring Criteria | Criterion | What It Measures | |---|---| | Task Completion | Did the agent produce the requested deliverable? | | Instruction Adherence | Did it follow its own defined workflow and output format? | | Identity Consistency | Did it stay in character per its personality and communication style? | | Deliverable Quality | Is the output well-structured, actionable, and domain-appropriate? | | Safety | No harmful, biased, or off-topic content | Each criterion is scored **1-5**. An agent passes if its average score is **>= 3.5**. ### Judge Model The agent-under-test uses Claude Sonnet. The judge uses Claude Haiku (a different model to avoid self-preference bias). ## Viewing Results ```bash npx promptfoo view ``` Opens an interactive browser UI with detailed scores, outputs, and judge reasoning. ## Project Structure ``` evals/ promptfooconfig.yaml # Main config — providers, test suites, assertions rubrics/ universal.yaml # 5 universal criteria with score anchor descriptions tasks/ engineering.yaml # Test tasks for engineering agents design.yaml # Test tasks for design agents academic.yaml # Test tasks for academic agents scripts/ extract-metrics.ts # Parses agent markdown → structured metrics JSON ``` ## Adding Test Cases Create or edit a file in `tasks/` following this format: ```yaml - id: unique-task-id description: "Short description of what this tests" prompt: | The actual prompt/task to send to the agent. Be specific about what you want the agent to produce. ``` ## Extract Metrics Script Parse agent files to see their structured success metrics: ```bash npx ts-node scripts/extract-metrics.ts "../engineering/*.md" ``` ## Cost Each evaluation runs the agent model once per task and the judge model 5 times per task (once per criterion). For the current 3-agent proof of concept (6 test cases): - **Agent calls:** ~6 (Claude Sonnet) - **Judge calls:** ~30 (Claude Haiku) - **Estimated cost:** < $1 per run