feat: add promptfoo eval harness for agent quality scoring (#371)

Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
This commit is contained in:
Russell Jones
2026-04-10 22:54:31 -04:00
committed by GitHub
parent 1e73b5be0d
commit b456845e85
11 changed files with 796 additions and 0 deletions

29
evals/tasks/academic.yaml Normal file
View File

@@ -0,0 +1,29 @@
# Test tasks for academic category agents.
# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
- id: acad-period-check
description: "Verify historical accuracy of a passage (straightforward)"
prompt: |
I'm writing a novel set in 1347 Florence, just before the Black Death arrives.
Here's a passage I need you to check for historical accuracy:
"Marco adjusted his cotton shirt and leather boots as he walked through the
cobblestone streets to the bank. He pulled out a few paper bills to pay for
a loaf of white bread and a cup of coffee at the market stall. The church
bells rang noon as horse-drawn carriages rattled past."
Please identify any anachronisms and suggest corrections.
- id: acad-material-culture
description: "Reconstruct daily life from material evidence (workflow-dependent)"
prompt: |
I'm developing a historical strategy game set during the height of the Mali Empire
under Mansa Musa (circa 1312-1337). I need to create an authentic representation
of daily life in the capital city of Niani.
What would a typical market day look like? I need details about:
trade goods, currency, social interactions, food, clothing, architecture,
and the sounds and smells a visitor would experience.
Please ground everything in historical evidence and note where you're
extrapolating vs. working from documented sources.

23
evals/tasks/design.yaml Normal file
View File

@@ -0,0 +1,23 @@
# Test tasks for design category agents.
# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
- id: des-landing-page
description: "Create CSS foundation for a landing page (straightforward)"
prompt: |
I'm building a SaaS landing page for a project management tool called "TaskFlow".
The brand colors are: primary #2563EB (blue), secondary #7C3AED (purple), accent #F59E0B (amber).
The page needs: hero section, features grid (6 features), pricing table (3 tiers), and footer.
Please create the CSS design system foundation and layout structure.
- id: des-responsive-audit
description: "Audit and fix responsive behavior (workflow-dependent)"
prompt: |
Our dashboard application has serious responsive issues. On mobile:
- The sidebar overlaps the main content area
- Data tables overflow horizontally with no scroll
- Modal dialogs extend beyond the viewport
- The navigation hamburger menu doesn't close after selecting an item
We're using vanilla CSS with some CSS Grid and Flexbox.
Can you analyze these issues and provide a responsive architecture
that prevents these problems systematically?

View File

@@ -0,0 +1,21 @@
# Test tasks for engineering category agents.
# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
- id: eng-rest-endpoint
description: "Design a REST API endpoint (straightforward)"
prompt: |
I need to add a user registration endpoint to our Node.js Express API.
It should accept email, password, and display name.
We use PostgreSQL and need input validation.
Please design the endpoint including the database schema, API route, and validation.
- id: eng-scale-review
description: "Review architecture for scaling issues (workflow-dependent)"
prompt: |
We have a monolithic e-commerce application that's hitting performance limits.
Current stack: Node.js, PostgreSQL, Redis for sessions, deployed on a single EC2 instance.
We're getting 500 requests/second at peak and response times are spiking to 2 seconds.
Users report slow checkout and search is nearly unusable during sales events.
Can you analyze the architecture and recommend a scaling strategy?
We have a 3-month timeline and a small team of 4 developers.