Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
89 lines
2.6 KiB
Markdown
89 lines
2.6 KiB
Markdown
# Agency-Agents Evaluation Harness
|
|
|
|
Automated quality evaluation for the agency-agents specialist prompt collection using [promptfoo](https://www.promptfoo.dev/).
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
cd evals
|
|
npm install
|
|
export ANTHROPIC_API_KEY=your-key-here
|
|
npx promptfoo eval
|
|
```
|
|
|
|
## How It Works
|
|
|
|
The eval harness tests each specialist agent prompt by:
|
|
|
|
1. Loading the agent's markdown file as a system prompt
|
|
2. Sending it a representative task for its category
|
|
3. Using a separate LLM-as-judge to score the output on 5 criteria
|
|
4. Reporting pass/fail per agent
|
|
|
|
### Scoring Criteria
|
|
|
|
| Criterion | What It Measures |
|
|
|---|---|
|
|
| Task Completion | Did the agent produce the requested deliverable? |
|
|
| Instruction Adherence | Did it follow its own defined workflow and output format? |
|
|
| Identity Consistency | Did it stay in character per its personality and communication style? |
|
|
| Deliverable Quality | Is the output well-structured, actionable, and domain-appropriate? |
|
|
| Safety | No harmful, biased, or off-topic content |
|
|
|
|
Each criterion is scored **1-5**. An agent passes if its average score is **>= 3.5**.
|
|
|
|
### Judge Model
|
|
|
|
The agent-under-test uses Claude Sonnet. The judge uses Claude Haiku (a different model to avoid self-preference bias).
|
|
|
|
## Viewing Results
|
|
|
|
```bash
|
|
npx promptfoo view
|
|
```
|
|
|
|
Opens an interactive browser UI with detailed scores, outputs, and judge reasoning.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
evals/
|
|
promptfooconfig.yaml # Main config — providers, test suites, assertions
|
|
rubrics/
|
|
universal.yaml # 5 universal criteria with score anchor descriptions
|
|
tasks/
|
|
engineering.yaml # Test tasks for engineering agents
|
|
design.yaml # Test tasks for design agents
|
|
academic.yaml # Test tasks for academic agents
|
|
scripts/
|
|
extract-metrics.ts # Parses agent markdown → structured metrics JSON
|
|
```
|
|
|
|
## Adding Test Cases
|
|
|
|
Create or edit a file in `tasks/` following this format:
|
|
|
|
```yaml
|
|
- id: unique-task-id
|
|
description: "Short description of what this tests"
|
|
prompt: |
|
|
The actual prompt/task to send to the agent.
|
|
Be specific about what you want the agent to produce.
|
|
```
|
|
|
|
## Extract Metrics Script
|
|
|
|
Parse agent files to see their structured success metrics:
|
|
|
|
```bash
|
|
npx ts-node scripts/extract-metrics.ts "../engineering/*.md"
|
|
```
|
|
|
|
## Cost
|
|
|
|
Each evaluation runs the agent model once per task and the judge model 5 times per task (once per criterion). For the current 3-agent proof of concept (6 test cases):
|
|
|
|
- **Agent calls:** ~6 (Claude Sonnet)
|
|
- **Judge calls:** ~30 (Claude Haiku)
|
|
- **Estimated cost:** < $1 per run
|