feat: add promptfoo eval harness for agent quality scoring (#371)

Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
2026-04-10 22:54:31 -04:00
parent 1e73b5be0d
commit b456845e85
11 changed files with 796 additions and 0 deletions
--- a/evals/tasks/engineering.yaml
+++ b/evals/tasks/engineering.yaml
@@ -0,0 +1,21 @@
+# Test tasks for engineering category agents.
+# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
+
+- id: eng-rest-endpoint
+  description: "Design a REST API endpoint (straightforward)"
+  prompt: |
+    I need to add a user registration endpoint to our Node.js Express API.
+    It should accept email, password, and display name.
+    We use PostgreSQL and need input validation.
+    Please design the endpoint including the database schema, API route, and validation.
+
+- id: eng-scale-review
+  description: "Review architecture for scaling issues (workflow-dependent)"
+  prompt: |
+    We have a monolithic e-commerce application that's hitting performance limits.
+    Current stack: Node.js, PostgreSQL, Redis for sessions, deployed on a single EC2 instance.
+    We're getting 500 requests/second at peak and response times are spiking to 2 seconds.
+    Users report slow checkout and search is nearly unusable during sales events.
+
+    Can you analyze the architecture and recommend a scaling strategy?
+    We have a 3-month timeline and a small team of 4 developers.