feat: add promptfoo eval harness for agent quality scoring (#371)

Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
2026-04-10 21:54:31 -05:00
parent 1e73b5be0d
commit b456845e85
11 changed files with 796 additions and 0 deletions
@@ -0,0 +1,29 @@
+# Test tasks for academic category agents.
+# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
+
+- id: acad-period-check
+  description: "Verify historical accuracy of a passage (straightforward)"
+  prompt: |
+    I'm writing a novel set in 1347 Florence, just before the Black Death arrives.
+    Here's a passage I need you to check for historical accuracy:
+
+    "Marco adjusted his cotton shirt and leather boots as he walked through the
+    cobblestone streets to the bank. He pulled out a few paper bills to pay for
+    a loaf of white bread and a cup of coffee at the market stall. The church
+    bells rang noon as horse-drawn carriages rattled past."
+
+    Please identify any anachronisms and suggest corrections.
+
+- id: acad-material-culture
+  description: "Reconstruct daily life from material evidence (workflow-dependent)"
+  prompt: |
+    I'm developing a historical strategy game set during the height of the Mali Empire
+    under Mansa Musa (circa 1312-1337). I need to create an authentic representation
+    of daily life in the capital city of Niani.
+
+    What would a typical market day look like? I need details about:
+    trade goods, currency, social interactions, food, clothing, architecture,
+    and the sounds and smells a visitor would experience.
+
+    Please ground everything in historical evidence and note where you're
+    extrapolating vs. working from documented sources.
@@ -0,0 +1,23 @@
+# Test tasks for design category agents.
+# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
+
+- id: des-landing-page
+  description: "Create CSS foundation for a landing page (straightforward)"
+  prompt: |
+    I'm building a SaaS landing page for a project management tool called "TaskFlow".
+    The brand colors are: primary #2563EB (blue), secondary #7C3AED (purple), accent #F59E0B (amber).
+    The page needs: hero section, features grid (6 features), pricing table (3 tiers), and footer.
+    Please create the CSS design system foundation and layout structure.
+
+- id: des-responsive-audit
+  description: "Audit and fix responsive behavior (workflow-dependent)"
+  prompt: |
+    Our dashboard application has serious responsive issues. On mobile:
+    - The sidebar overlaps the main content area
+    - Data tables overflow horizontally with no scroll
+    - Modal dialogs extend beyond the viewport
+    - The navigation hamburger menu doesn't close after selecting an item
+
+    We're using vanilla CSS with some CSS Grid and Flexbox.
+    Can you analyze these issues and provide a responsive architecture
+    that prevents these problems systematically?
@@ -0,0 +1,21 @@
+# Test tasks for engineering category agents.
+# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
+
+- id: eng-rest-endpoint
+  description: "Design a REST API endpoint (straightforward)"
+  prompt: |
+    I need to add a user registration endpoint to our Node.js Express API.
+    It should accept email, password, and display name.
+    We use PostgreSQL and need input validation.
+    Please design the endpoint including the database schema, API route, and validation.
+
+- id: eng-scale-review
+  description: "Review architecture for scaling issues (workflow-dependent)"
+  prompt: |
+    We have a monolithic e-commerce application that's hitting performance limits.
+    Current stack: Node.js, PostgreSQL, Redis for sessions, deployed on a single EC2 instance.
+    We're getting 500 requests/second at peak and response times are spiking to 2 seconds.
+    Users report slow checkout and search is nearly unusable during sales events.
+
+    Can you analyze the architecture and recommend a scaling strategy?
+    We have a 3-month timeline and a small team of 4 developers.