docs: lock in visibility layer, resolve all 5 open design questions

- Resolve T3 mesh mechanics: blackboard-based draft/commit cycle - Resolve T1 plan output schema: formal JSON structure with workstreams + parallelism groups - Resolve T5 consensus: T3 aggregates joint verdict (pass/partial/fail), partial retries failed slices only - Resolve path amendment mechanism: event-based, runner notifies higher tier, no approval gate - Resolve failure handling: confirmed distributed ownership, runner owns T1 + terminal only Add run visibility layer: - Human-readable live log (normal + verbose modes) - Configurable inspection gates (t1_plan always, t2_synthesis recommended, others optional) - strict_mode flag for full gating on early runs - cli/agency.py: run, watch, inspect, approve, reject, pause, resume - gate_pending halt loop in team_runner, gate_approved/rejected resume - Expanded blackboard event vocabulary (gate_*, path_amendment, log) - t3_task_lists table for mesh coordination state - Inspection gate flow added to buildspec Key Flows Build order updated: 16 steps (added cli/ step, clarified runner gate responsibilities)
2026-03-30 13:43:19 -04:00
parent 882b769d21
commit a721db63f6
2 changed files with 424 additions and 29 deletions
--- a/docs/design.md
+++ b/docs/design.md
@@ -1,22 +1,22 @@
 # Tiered Agent Team System — Design Document

-_Started: 2026-03-14. Last updated: 2026-03-16 (evening)._
+_Started: 2026-03-14. Last updated: 2026-03-30._

 ---

-## Open Design Questions
+## Resolved Design Decisions (formerly Open Questions)

-The following areas are identified but not yet resolved. Work through these before implementing `core/team_runner.py`.
+All five open questions resolved 2026-03-30. Details in Decisions Log.

-1. **T3 mesh mechanics** — How do T3s within the same T2 domain coordinate? Via blackboard, direct message exchange, or a designated T3 lead? What does "negotiate task boundaries" look like concretely?
+1. **T3 mesh mechanics** → Blackboard-based. T3s write draft task lists, read peers', commit merged plan before T4 dispatch. See _T3 Mesh via Blackboard_.

-2. **T1 output schema** — What does T1's Plan phase output look like as structured data? Needs a formal schema: workstreams, tier paths, parallelism flags, retry budget, T2 specialist list. This is what the runner parses to bootstrap the pipeline.
+2. **T1 output schema** → Formal JSON schema defined. See _T1 Plan Output Schema_.

-3. **T5 consensus mechanics** — Individual T5s review their slice and produce results. Who aggregates? What does the joint verdict look like as structured data? What happens on split verdict (some T5s pass, some fail)?
+3. **T5 consensus mechanics** → T3 aggregates all T5 results into a joint verdict. Split verdict (`partial`) triggers retry of failed slices only. See _T5 Consensus & Verdict Schema_.

-4. **Path amendment mechanism** — When a mid-run tier proposes a path amendment, what's the concrete mechanism? Who writes to the blackboard, in what format, and how does the relevant higher tier get notified?
+4. **Path amendment mechanism** → Amending tier writes a `path_amendment` event to blackboard. Runner monitors events table and notifies the relevant higher tier via a system event. No agent callback plumbing required. See _Path Amendment Mechanism_.

-5. **Failure handling (distributed model)** — The current failure table assumes centralised runner handling. Needs to be rewritten to reflect distributed ownership: T3 handles T4 failures, T2 handles T3 failures, T1 handles T2 failures. Runner only handles T1 failure and terminal escalation to human.
+5. **Failure handling (distributed model)** → Distributed ownership confirmed. Runner only owns T1 failure + terminal human escalation. See updated _Failure Handling_ table.

 ---

@@ -167,6 +167,204 @@ T1 — Phase 1: Plan (self-critique → Andrew approval)

 ---

+## Use Case Flows
+
+T1 assesses complexity and prescribes the tier path per workstream. Three standard depth profiles:
+
+### Full Stack — T1→T2→T3→T4→T5
+*Complex feature, new product, cross-domain changes*
+
+```
+T1 Plan
+  → assess complexity (high)
+  → output T1 Plan Schema (workstreams, tier paths [T2,T3,T4,T5], parallelism, retry budgets)
+  → self-critique pass
+  → GATE: surface to Andrew ← approval required
+
+T2 Lead (spawned by runner after approval)
+  → receive: goal + full workplan
+  → publish: domain boundaries + shared assumptions doc → blackboard
+  → GATE (optional): review boundaries before specialists spawn
+
+T2 Specialists (parallel fan-out, wait on Lead)
+  → each receives: their domain boundary + shared assumptions
+  → produce: architecture proposal for their slice
+  → Lead synthesises, drives conflict resolution if needed
+  → Lead writes: canonical architecture → blackboard
+  → GATE (recommended): review architecture before implementation
+
+Each T2 Specialist → spawns its own T3s (with canonical architecture slice + interface contracts)
+
+T3s (light mesh within T2 domain)
+  → write draft task lists to blackboard
+  → read peers' lists, reconcile boundaries
+  → commit merged task plan before T4 dispatch
+  → GATE (optional): review task breakdown
+
+T4s
+  → swarm: independent tasks run in parallel
+  → pipeline: T4-A output feeds T4-B (T3 declares dependencies)
+  → commit to feature branches
+
+T5s (fan-out per T4 slice)
+  → each reviews its slice independently
+  → T3 collects results → joint verdict
+  → GATE (optional): review T5 verdict before T3 marks done
+  → partial: T3 retries only failed slices
+  → pass: T3 signals workstream done to T2
+
+T2 specialists → signal T2 Lead
+T2 Lead → writes integration summary → blackboard
+
+T1 Accept
+  → validate against goal anchor
+  → open PR, notify Andrew via Hans
+```
+
+### Medium Complexity — T1→T3→T4→T5
+*Config change, isolated bug fix — T1 determines no cross-domain design needed*
+
+```
+T1 Plan
+  → assess: contained scope, single domain, no T2 architecture needed
+  → workplan: tier paths [T3, T4, T5]
+  → GATE: Andrew approval
+
+T3s spawned directly by runner
+  → receives T1 brief with task context (no T2 architecture layer)
+  → T3 light mesh → T4 dispatch → T5 verify → signal done
+
+T1 Accept → PR
+```
+
+### Simple / Hotfix — T1→T4→T5
+*Single file, single function, trivial atomic task*
+
+```
+T1 Plan
+  → assess: trivial, single workstream
+  → tier path: [T4, T5]
+  → GATE: Andrew approval
+
+T4 (coding agent)
+  → single atomic task, commits
+
+T5 (single verifier, not full fan-out)
+  → code review + correctness check
+  → pass → T1 Accept → PR
+```
+
+---
+
+## Resolved Mechanics
+
+### T3 Mesh via Blackboard
+
+T3s coordinate task boundaries before dispatching T4s. All coordination goes through the blackboard — no direct agent-to-agent messaging.
+
+1. Each T3 writes its **draft task list** to the blackboard (one row per proposed T4 task, status `draft`)
+2. Each T3 reads all sibling T3 draft lists in its T2 domain
+3. T3s amend their lists to resolve overlaps (claim tasks, release duplicates)
+4. Once all T3s in the domain have committed their final task lists (status `committed`), T4 dispatch begins
+5. No T3 dispatches T4s until all peers in the domain are committed — this prevents duplicate work
+
+The runner monitors for `all_committed` state and can enforce a timeout (config: `t3_mesh_timeout_minutes`).
+
+---
+
+### T1 Plan Output Schema
+
+T1's Plan phase produces a structured JSON object written to the blackboard. The runner parses this to bootstrap the pipeline.
+
+```json
+{
+  "run_id": "uuid",
+  "goal_anchor": "Original goal — immutable, propagated to every downstream brief",
+  "complexity": "high | medium | low",
+  "retry_budget_multiplier": 2,
+  "workstreams": [
+    {
+      "id": "ws-backend-api",
+      "name": "Backend API",
+      "domain": "backend",
+      "tier_path": ["t2", "t3", "t4", "t5"],
+      "parallel_group": "A",
+      "t2_specialist": "agents/engineering/engineering-software-architect.md",
+      "notes": "Focus on webhook ingest and retry queue"
+    }
+  ],
+  "parallelism": {
+    "groups": {
+      "A": ["ws-backend-api", "ws-frontend"],
+      "B": ["ws-infra"]
+    },
+    "sequence": ["A", "B"]
+  },
+  "self_critique_summary": "Brief plain-text summary of what T1 identified and amended in its self-critique pass"
+}
+```
+
+`parallel_group` + `sequence` handles inter-workstream dependencies: group A runs in parallel, then B starts after A completes.
+
+---
+
+### T5 Consensus & Verdict Schema
+
+T3 aggregates all T5 results into a joint verdict after fan-out completes.
+
+**Individual T5 result:**
+```json
+{
+  "verifier_id": "uuid",
+  "scope": "queue-client",
+  "verdict": "pass | fail",
+  "issues": ["issue description..."],
+  "notes": "human-readable summary"
+}
+```
+
+**T3 joint verdict (written to blackboard):**
+```json
+{
+  "t5_results": [...],
+  "joint_verdict": "pass | partial | fail",
+  "failed_scopes": ["queue-client"],
+  "summary": "Human-readable summary for gate surface and logs"
+}
+```
+
+**Split verdict handling:**
+- `pass` → T3 marks workstream done, signals T2
+- `partial` → T3 retries only the failed T4 slices (up to retry budget), re-runs T5 on those slices
+- `fail` → T3 escalates to T2 (or T1 if shallow path)
+
+---
+
+### Path Amendment Mechanism
+
+When a mid-run tier discovers scope that warrants a different tier path than T1 prescribed:
+
+1. The discovering tier writes a `path_amendment` event to the blackboard:
+```json
+{
+  "kind": "path_amendment",
+  "proposed_by": "t3/ws-backend-api",
+  "reason": "Discovered auth dependency requires T2 architectural pass",
+  "amendment": {
+    "workstream": "ws-backend-api",
+    "add_tiers": ["t2"],
+    "insert_before": "t3"
+  }
+}
+```
+2. The runner monitors the events table, detects `path_amendment`, and sends a system event notification to the relevant higher tier
+3. The higher tier is **informed, not blocked** — it acknowledges and adjusts its understanding
+4. Amendment is logged on the blackboard for audit; no approval gate required (the next scheduled human gate will surface it)
+
+No agent needs callback plumbing. The runner is the notification bridge.
+
+---
+
 ## Shared State

 For software pipelines, **the repo is the primary blackboard**:
@@ -185,14 +383,21 @@ Supplemented by a SQLite coordination store per run tracking:

 ## Failure Handling

-| Failure | Handler | Action |
-|---------|---------|--------|
-| T4 bad output | T3 | Retry T4 with corrected brief (up to retry_budget) |
-| T4 blocked | T3 | Escalate immediately — no retries |
-| T4 partial output | T3 | Salvage good parts, re-task remainder |
-| T3 workstream stuck | T2 | Re-scope or split the workstream |
-| T2 design wrong | T1 | Re-plan; may discard workstream and restart |
-| Repeated escalation | Surface to user | Block until human unblocks |
+Distributed ownership — each tier handles failures in the tier below it. The runner only handles T1 failure and terminal human escalation.
+
+| Failure | Owner | Handler | Action |
+|---------|-------|---------|--------|
+| T4 bad output | T3 | `escalation.py` called by T3's context | Retry T4 with corrected brief (up to retry_budget) |
+| T4 blocked | T3 | `escalation.py` | Escalate to T3 immediately — no retries |
+| T4 partial output | T3 | `escalation.py` | Salvage good parts, re-task remainder |
+| T5 partial verdict | T3 | T3 joint verdict logic | Retry failed T4 slices only |
+| T5 full fail | T3 | T3 joint verdict logic | Escalate to T2 |
+| T3 workstream stuck | T2 | T2 specialist prompt + blackboard | Re-scope or split the workstream |
+| T2 design wrong | T1 | T1 Accept phase + blackboard | Re-plan; may discard workstream and restart |
+| T1 failure / crash | Runner | `team_runner.py` | Surface to human, halt run |
+| Repeated escalation | Runner | `team_runner.py` | Gate: block until human unblocks |
+
+**Key distinction:** `escalation.py` is not called by the runner centrally. It is logic that tier agents execute (or the runner executes on their behalf when it detects a timeout or dead agent). The runner only owns the last two rows.

 Retry limits prevent infinite loops. Escalation path is always upward, never sideways.

@@ -264,6 +469,114 @@ T4 and T5 default to the **coding agent runtime** when available. Falls back to

 ---

+## Run Visibility Layer
+
+Designed for debugging, test runs, and quality evaluation at each tier. Three interlocking components.
+
+### 1. Human-Readable Live Log
+
+Structured events from the blackboard rendered as a timestamped, readable stream. `agency watch <run_id>` tails this live.
+
+```
+[abc123] 12:30:01  T1   PLAN_START    Assessing scope: "Build webhook ingestion system"
+[abc123] 12:30:14  T1   PLAN_DONE     3 workstreams — backend-api, infra, docs (2 parallel)
+[abc123] 12:30:14  GATE APPROVAL      ⏸  Waiting on approval before T2 spawns
+[abc123] 12:31:02  GATE APPROVED      ✓  Approved — continuing
+[abc123] 12:31:03  T2   LEAD_START    Lead Architect spawned
+[abc123] 12:31:41  T2   BOUNDS_READY  Domain boundaries + shared assumptions published
+[abc123] 12:31:42  T2   SPEC_START    3 specialists spawned (parallel): backend, infra, docs
+[abc123] 12:32:15  T2   SPEC_DONE     backend-api architecture draft ready
+[abc123] 12:32:58  T2   SYNTH_DONE    Canonical architecture written to blackboard
+[abc123] 12:32:58  GATE INSPECTION    ⏸  T2 synthesis ready for review
+[abc123] 12:33:44  T3   MESH_START    backend-api: 2 squad leads negotiating task boundaries
+[abc123] 12:34:01  T3   MESH_DONE     Task split committed — 7 T4 tasks (5 swarm, 2 pipeline)
+[abc123] 12:34:02  T4   SWARM_START   5 workers spawned in parallel
+[abc123] 12:35:10  T4   DONE          worker-3 auth-middleware ✓
+[abc123] 12:35:22  T4   FAIL          worker-4 queue-client ✗  (retry 1/3)
+[abc123] 12:36:04  T4   DONE          worker-4 queue-client ✓  (retry resolved)
+[abc123] 12:36:05  T5   VERIFY_START  4 verifiers spawned
+[abc123] 12:36:45  T5   VERDICT       partial — queue-client needs rework
+[abc123] 12:37:12  T5   VERDICT       ✓  all pass — workstream backend-api done
+```
+
+Log level `verbose` adds per-T4-start/done lines. Default is `normal` (tier-level events only).
+
+### 2. Inspection Gates
+
+Configurable pause points. When the runner hits a gate, it:
+1. Writes a `gate_pending` event to the blackboard
+2. Fires `notify_adapter.send()` with a tier summary to Andrew (via Hans)
+3. Halts — no next tier spawns until `gate_approved` or `gate_rejected` is written
+
+The tier summary surfaced at each gate includes:
+- **What was produced** (the tier artifact in readable form)
+- **What happens next** (which agents will spawn, doing what)
+- **Any anomalies** flagged by the tier itself
+
+Configurable in `team.yaml` under `visibility.inspection_gates`. A `strict_mode: true` flag enables all gates — recommended for first runs on a new codebase or new goal type.
+
+```yaml
+visibility:
+  strict_mode: false
+  log_level: normal           # normal | verbose
+  inspection_gates:
+    t1_plan: true             # always — required by design
+    t2_lead: false            # optional — review boundaries before specialists
+    t2_synthesis: true        # recommended — review architecture before implementation
+    t3_plan: false            # verbose — useful early on, disable once T3 is trusted
+    t5_verdict: false         # review T5 joint verdict before T3 marks workstream done
+  gate_timeout_minutes: 60    # auto-reject if no response within this window
+```
+
+### 3. Inspection CLI — `cli/agency.py`
+
+```
+agency run <config.yaml>               # start a run, returns run_id
+agency watch <run_id>                  # tail live log (follows blackboard events)
+agency inspect <run_id>                # interactive tree view of run state
+agency inspect <run_id> --tier t2      # jump to T2 artifacts
+agency inspect <run_id> --brief <id>   # show full brief + result JSON
+
+agency approve <run_id>                # approve current gate → continue
+agency approve <run_id> --note "..."   # approve with a note written to blackboard
+agency reject <run_id> --reason "..."  # reject → tier re-invoked
+agency pause <run_id>                  # force-pause at next tier boundary
+agency resume <run_id>                 # release a manual pause
+```
+
+`agency inspect` (no flags) renders a live tree:
+```
+Run abc123 — "Build webhook ingestion system"
+├── T1 Plan ✓
+│   └── [view workplan]
+├── T2 Architecture ✓  [GATE: pending review]
+│   ├── [view domain boundaries]
+│   ├── [view shared assumptions]
+│   └── [view canonical architecture]
+├── T3 backend-api (active)
+│   ├── [view task breakdown]
+│   └── T4 workers: 3/7 done, 1 retrying, 3 pending
+└── T3 infra (pending)
+```
+
+### Blackboard Event Vocabulary (extended)
+
+```python
+# existing
+"spawned" | "completed" | "failed" | "escalated" | "retried"
+
+# new — visibility layer
+"gate_pending"     # runner hit a gate, waiting for human
+"gate_approved"    # human approved, run continues
+"gate_rejected"    # human rejected, tier re-invoked
+"gate_paused"      # manual pause via CLI
+"gate_resumed"     # manual resume via CLI
+"path_amendment"   # mid-run tier proposed path change
+"log"              # human-readable log line (level + message)
+```
+
+---
+
 ## Decisions Log

 **T1 dynamic dispatch** — T1 assesses scope and prescribes tier path and workstream parallelism. It does not prescribe internal tier coordination patterns.
@@ -297,3 +610,15 @@ T4 and T5 default to the **coding agent runtime** when available. Falls back to
 **Coding agent runtime** — Claude Code is default T4/T5 runtime. Opt-in `native_teams` flag available for internal Claude Code parallelism — faster but less blackboard visibility. Default `false`.

 **Agency-agents integration** — Via git submodule at `agents/`. T1 selects specialists via `config/role_registry.yaml`. `agent_personality` field on task brief; runtime injects as system prompt at spawn time.
+
+**T3 mesh mechanics** — Blackboard-based coordination. T3s write draft task lists, read peers', reconcile overlaps, commit merged plan. No T4 dispatch until all T3s in the domain have committed. Runner enforces timeout (`t3_mesh_timeout_minutes` in config). Chosen over designated T3 lead or direct messaging — fits distributed ownership model, gives full audit trail for free.
+
+**T1 output schema** — Formal JSON schema defined (2026-03-30). Fields: `run_id`, `goal_anchor`, `complexity`, `retry_budget_multiplier`, `workstreams[]` (id, name, domain, tier_path, parallel_group, t2_specialist, notes), `parallelism` (groups + sequence), `self_critique_summary`. `parallel_group` + `sequence` handles inter-workstream dependencies.
+
+**T5 consensus** — T3 aggregates all T5 results into joint verdict: `pass | partial | fail`. Split verdict (`partial`) → T3 retries only failed slices, re-runs T5 on those slices. Full `fail` escalates to T2. T3 writes structured joint verdict to blackboard; this is what the optional T5 gate surfaces to Andrew.
+
+**Path amendment mechanism** — Amending tier writes `path_amendment` event to blackboard (structured JSON: proposed_by, reason, amendment). Runner monitors events table, sends system event notification to relevant higher tier. Higher tier is informed, not blocked. No agent callback plumbing. Amendments surface at next scheduled human gate.
+
+**Failure handling (distributed)** — Confirmed distributed ownership (2026-03-30). `escalation.py` is logic tiers execute (or runner executes on tier's behalf on timeout/crash), not a central runner concern. Runner only owns: T1 failure, terminal human escalation. See updated Failure Handling table.
+
+**Run visibility layer** — Added 2026-03-30. Human-readable live log, configurable inspection gates, and `cli/agency.py` inspection/control commands. Designed for debugging and quality evaluation at each tier during early runs. `strict_mode: true` enables all gates. Gates surface tier artifacts + "what happens next" summary to Andrew via Hans. Resolves Q3 (T5 consensus surfaces as gate event with human-readable summary). T5 gate (optional) lets Andrew review joint verdict before T3 marks workstream done.