feat: add promptfoo eval harness for agent quality scoring (#371)

Adds promptfoo eval harness for agent quality scoring. LLM-as-judge system scoring task completion, instruction adherence, identity consistency, deliverable quality, and safety. Includes tests.
2026-04-10 22:54:31 -04:00
parent 1e73b5be0d
commit b456845e85
11 changed files with 796 additions and 0 deletions
--- a/evals/tasks/academic.yaml
+++ b/evals/tasks/academic.yaml
@@ -0,0 +1,29 @@
+# Test tasks for academic category agents.
+# 2 tasks: 1 straightforward, 1 requiring the agent's workflow.
+
+- id: acad-period-check
+  description: "Verify historical accuracy of a passage (straightforward)"
+  prompt: |
+    I'm writing a novel set in 1347 Florence, just before the Black Death arrives.
+    Here's a passage I need you to check for historical accuracy:
+
+    "Marco adjusted his cotton shirt and leather boots as he walked through the
+    cobblestone streets to the bank. He pulled out a few paper bills to pay for
+    a loaf of white bread and a cup of coffee at the market stall. The church
+    bells rang noon as horse-drawn carriages rattled past."
+
+    Please identify any anachronisms and suggest corrections.
+
+- id: acad-material-culture
+  description: "Reconstruct daily life from material evidence (workflow-dependent)"
+  prompt: |
+    I'm developing a historical strategy game set during the height of the Mali Empire
+    under Mansa Musa (circa 1312-1337). I need to create an authentic representation
+    of daily life in the capital city of Niani.
+
+    What would a typical market day look like? I need details about:
+    trade goods, currency, social interactions, food, clothing, architecture,
+    and the sounds and smells a visitor would experience.
+
+    Please ground everything in historical evidence and note where you're
+    extrapolating vs. working from documented sources.