feat: add promptfoo eval harness for agent quality scoring (#371)
Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
This commit is contained in:
29
evals/tasks/academic.yaml
Normal file
29
evals/tasks/academic.yaml
Normal file
@@ -0,0 +1,29 @@
|
||||
# Test tasks for academic category agents.
|
||||
# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
|
||||
|
||||
- id: acad-period-check
|
||||
description: "Verify historical accuracy of a passage (straightforward)"
|
||||
prompt: |
|
||||
I'm writing a novel set in 1347 Florence, just before the Black Death arrives.
|
||||
Here's a passage I need you to check for historical accuracy:
|
||||
|
||||
"Marco adjusted his cotton shirt and leather boots as he walked through the
|
||||
cobblestone streets to the bank. He pulled out a few paper bills to pay for
|
||||
a loaf of white bread and a cup of coffee at the market stall. The church
|
||||
bells rang noon as horse-drawn carriages rattled past."
|
||||
|
||||
Please identify any anachronisms and suggest corrections.
|
||||
|
||||
- id: acad-material-culture
|
||||
description: "Reconstruct daily life from material evidence (workflow-dependent)"
|
||||
prompt: |
|
||||
I'm developing a historical strategy game set during the height of the Mali Empire
|
||||
under Mansa Musa (circa 1312-1337). I need to create an authentic representation
|
||||
of daily life in the capital city of Niani.
|
||||
|
||||
What would a typical market day look like? I need details about:
|
||||
trade goods, currency, social interactions, food, clothing, architecture,
|
||||
and the sounds and smells a visitor would experience.
|
||||
|
||||
Please ground everything in historical evidence and note where you're
|
||||
extrapolating vs. working from documented sources.
|
||||
23
evals/tasks/design.yaml
Normal file
23
evals/tasks/design.yaml
Normal file
@@ -0,0 +1,23 @@
|
||||
# Test tasks for design category agents.
|
||||
# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
|
||||
|
||||
- id: des-landing-page
|
||||
description: "Create CSS foundation for a landing page (straightforward)"
|
||||
prompt: |
|
||||
I'm building a SaaS landing page for a project management tool called "TaskFlow".
|
||||
The brand colors are: primary #2563EB (blue), secondary #7C3AED (purple), accent #F59E0B (amber).
|
||||
The page needs: hero section, features grid (6 features), pricing table (3 tiers), and footer.
|
||||
Please create the CSS design system foundation and layout structure.
|
||||
|
||||
- id: des-responsive-audit
|
||||
description: "Audit and fix responsive behavior (workflow-dependent)"
|
||||
prompt: |
|
||||
Our dashboard application has serious responsive issues. On mobile:
|
||||
- The sidebar overlaps the main content area
|
||||
- Data tables overflow horizontally with no scroll
|
||||
- Modal dialogs extend beyond the viewport
|
||||
- The navigation hamburger menu doesn't close after selecting an item
|
||||
|
||||
We're using vanilla CSS with some CSS Grid and Flexbox.
|
||||
Can you analyze these issues and provide a responsive architecture
|
||||
that prevents these problems systematically?
|
||||
21
evals/tasks/engineering.yaml
Normal file
21
evals/tasks/engineering.yaml
Normal file
@@ -0,0 +1,21 @@
|
||||
# Test tasks for engineering category agents.
|
||||
# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
|
||||
|
||||
- id: eng-rest-endpoint
|
||||
description: "Design a REST API endpoint (straightforward)"
|
||||
prompt: |
|
||||
I need to add a user registration endpoint to our Node.js Express API.
|
||||
It should accept email, password, and display name.
|
||||
We use PostgreSQL and need input validation.
|
||||
Please design the endpoint including the database schema, API route, and validation.
|
||||
|
||||
- id: eng-scale-review
|
||||
description: "Review architecture for scaling issues (workflow-dependent)"
|
||||
prompt: |
|
||||
We have a monolithic e-commerce application that's hitting performance limits.
|
||||
Current stack: Node.js, PostgreSQL, Redis for sessions, deployed on a single EC2 instance.
|
||||
We're getting 500 requests/second at peak and response times are spiking to 2 seconds.
|
||||
Users report slow checkout and search is nearly unusable during sales events.
|
||||
|
||||
Can you analyze the architecture and recommend a scaling strategy?
|
||||
We have a 3-month timeline and a small team of 4 developers.
|
||||
Reference in New Issue
Block a user