From 882b769d21d5557fa78e0da78a0f1c379d4d22b5 Mon Sep 17 00:00:00 2001 From: Hans Heinemann Date: Mon, 30 Mar 2026 09:00:16 -0400 Subject: [PATCH 1/4] chore: sync agency-agents submodule with upstream --- agents | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/agents b/agents index aacfb86..5f1204a 160000 --- a/agents +++ b/agents @@ -1 +1 @@ -Subproject commit aacfb86196d2b7c6ecb58df835834ffef80198eb +Subproject commit 5f1204a02350b1644035537c9bb4f6e15cdbc9fd From a721db63f623c8b48732d0468dc0883387debd9a Mon Sep 17 00:00:00 2001 From: Hans Heinemann Date: Mon, 30 Mar 2026 13:43:19 -0400 Subject: [PATCH 2/4] docs: lock in visibility layer, resolve all 5 open design questions - Resolve T3 mesh mechanics: blackboard-based draft/commit cycle - Resolve T1 plan output schema: formal JSON structure with workstreams + parallelism groups - Resolve T5 consensus: T3 aggregates joint verdict (pass/partial/fail), partial retries failed slices only - Resolve path amendment mechanism: event-based, runner notifies higher tier, no approval gate - Resolve failure handling: confirmed distributed ownership, runner owns T1 + terminal only Add run visibility layer: - Human-readable live log (normal + verbose modes) - Configurable inspection gates (t1_plan always, t2_synthesis recommended, others optional) - strict_mode flag for full gating on early runs - cli/agency.py: run, watch, inspect, approve, reject, pause, resume - gate_pending halt loop in team_runner, gate_approved/rejected resume - Expanded blackboard event vocabulary (gate_*, path_amendment, log) - t3_task_lists table for mesh coordination state - Inspection gate flow added to buildspec Key Flows Build order updated: 16 steps (added cli/ step, clarified runner gate responsibilities) --- docs/buildspec.md | 96 +++++++++++-- docs/design.md | 357 +++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 424 insertions(+), 29 deletions(-) diff --git a/docs/buildspec.md b/docs/buildspec.md index dbabb73..2b3cac5 100644 --- a/docs/buildspec.md +++ b/docs/buildspec.md @@ -1,7 +1,7 @@ # Tiered Agent Team System — Build Spec -_Started: 2026-03-15. Status: Pre-build._ -_See agent-teams-design.md for the design doc and decisions log._ +_Started: 2026-03-15. Last updated: 2026-03-30._ +_See design.md for the design doc and decisions log._ --- @@ -68,6 +68,9 @@ agent-teams/ │ ├── team.yaml — example run configuration │ └── role_registry.yaml — maps (tier, domain) → agent personality file │ +├── cli/ +│ └── agency.py — run, watch, inspect, approve, reject, pause, resume +│ ├── runs/ — runtime state, one subdir per run_id │ └── .gitkeep │ @@ -131,12 +134,43 @@ CREATE TABLE events ( event_id TEXT PRIMARY KEY, run_id TEXT NOT NULL, brief_id TEXT, - kind TEXT NOT NULL, -- spawned | completed | failed | escalated | retried + kind TEXT NOT NULL, -- see event vocabulary below detail TEXT, -- JSON created_at TEXT NOT NULL ); ``` +**Event kind vocabulary:** +``` +-- lifecycle +spawned | completed | failed | escalated | retried + +-- visibility / gates +gate_pending -- runner hit an inspection gate, waiting for human +gate_approved -- human approved via CLI or notify +gate_rejected -- human rejected, tier re-invoked +gate_paused -- manual pause via CLI +gate_resumed -- manual resume via CLI + +-- amendments / informational +path_amendment -- mid-run tier proposed a tier path change +log -- human-readable log line (detail: {level, message}) +``` + +**t3_task_lists** *(T3 mesh coordination)* +```sql +CREATE TABLE t3_task_lists ( + entry_id TEXT PRIMARY KEY, + run_id TEXT NOT NULL, + workstream_id TEXT NOT NULL, + t3_agent_id TEXT NOT NULL, + status TEXT NOT NULL, -- draft | committed + tasks TEXT NOT NULL, -- JSON array of proposed T4 task descriptors + created_at TEXT NOT NULL, + updated_at TEXT NOT NULL +); +``` + --- ## Task Brief Schema @@ -283,6 +317,19 @@ retry_defaults: bad_output: 3 partial: 2 blocked: 0 # always escalate immediately + +visibility: + strict_mode: false # true = all gates on (recommended for first runs) + log_level: normal # normal | verbose (verbose = per-T4 start/done lines) + inspection_gates: + t1_plan: true # always — required by design + t2_lead: false # optional — review boundaries before specialists spawn + t2_synthesis: true # recommended — review architecture before implementation + t3_plan: false # verbose — useful early on, disable once T3 is trusted + t5_verdict: false # review T5 joint verdict before T3 marks workstream done + gate_timeout_minutes: 60 # auto-reject if no human response within this window + +t3_mesh_timeout_minutes: 10 # max time for T3s to commit task lists before runner escalates ``` --- @@ -388,7 +435,29 @@ spawn T4 with brief → notify T3 ``` -### 4. Review Gate +### 4. Inspection Gate Flow + +``` +runner reaches configured gate (e.g. t2_synthesis) + → write event(gate_pending, detail={tier, summary, what_happens_next}) + → notify_adapter.send(tier summary to Andrew via Hans) + → halt: poll blackboard for gate_approved or gate_rejected + + gate_approved: + → write event(gate_approved) + → continue run + + gate_rejected: + → write event(gate_rejected, detail={reason}) + → re-invoke tier with rejection reason in brief context + → loop back to gate_pending when tier completes again + + gate_timeout (gate_timeout_minutes elapsed): + → treat as gate_rejected + → notify Andrew: "Gate timed out, re-invoking tier" +``` + +### 5. Review Gate ``` T1 completes integration @@ -412,19 +481,20 @@ T1 completes integration 1. `git submodule add https://github.com/msitarzewski/agency-agents agents/` — pull the talent pool 2. `config/role_registry.yaml` — map tier+domain → agent personality files -3. `core/task_brief.py` — schema + validation (everything depends on this) -4. `core/blackboard.py` — SQLite store, all table definitions +3. `core/task_brief.py` — schema + validation (everything depends on this); include T1 Plan Output Schema +4. `core/blackboard.py` — SQLite store, all table definitions including `t3_task_lists`; full event kind vocabulary 5. `adapters/base/*` — all four abstract interfaces 6. `adapters/llm/anthropic.py` — first LLM implementation -7. `core/escalation.py` — retry + failure routing logic +7. `core/escalation.py` — retry + failure routing logic (called by tiers, not runner centrally) 8. `adapters/runtime/openclaw.py` — wire up sessions_spawn + personality injection 9. `adapters/runtime/claude_code.py` — coding agent runtime, personality via --system-prompt -10. `core/team_runner.py` — full run lifecycle, runtime + personality selection -11. `prompts/` — fallback tier prompts (used when no agent_personality set) -12. `adapters/vcs/github.py` — PR creation + branch management -13. `adapters/notify/openclaw.py` — Hans notification -14. `config/team.yaml` — example config -15. `README.md` — how to run, how to add adapters, how to extend the roster +10. `core/team_runner.py` — full run lifecycle: gate logic (gate_pending halt loop, gate_approved resume), path amendment monitor, T1 failure + terminal escalation only +11. `cli/agency.py` — run, watch, inspect, approve, reject, pause, resume; `watch` tails blackboard events and renders live log; `inspect` renders run tree +12. `prompts/` — fallback tier prompts (used when no agent_personality set) +13. `adapters/vcs/github.py` — PR creation + branch management +14. `adapters/notify/openclaw.py` — Hans notification; used for gate surfaces (tier summary to Andrew) +15. `config/team.yaml` — example config with full visibility block +16. `README.md` — how to run, how to add adapters, how to extend the roster; include `agency` CLI reference --- diff --git a/docs/design.md b/docs/design.md index 5679dc9..8486f51 100644 --- a/docs/design.md +++ b/docs/design.md @@ -1,22 +1,22 @@ # Tiered Agent Team System — Design Document -_Started: 2026-03-14. Last updated: 2026-03-16 (evening)._ +_Started: 2026-03-14. Last updated: 2026-03-30._ --- -## Open Design Questions +## Resolved Design Decisions (formerly Open Questions) -The following areas are identified but not yet resolved. Work through these before implementing `core/team_runner.py`. +All five open questions resolved 2026-03-30. Details in Decisions Log. -1. **T3 mesh mechanics** — How do T3s within the same T2 domain coordinate? Via blackboard, direct message exchange, or a designated T3 lead? What does "negotiate task boundaries" look like concretely? +1. **T3 mesh mechanics** → Blackboard-based. T3s write draft task lists, read peers', commit merged plan before T4 dispatch. See _T3 Mesh via Blackboard_. -2. **T1 output schema** — What does T1's Plan phase output look like as structured data? Needs a formal schema: workstreams, tier paths, parallelism flags, retry budget, T2 specialist list. This is what the runner parses to bootstrap the pipeline. +2. **T1 output schema** → Formal JSON schema defined. See _T1 Plan Output Schema_. -3. **T5 consensus mechanics** — Individual T5s review their slice and produce results. Who aggregates? What does the joint verdict look like as structured data? What happens on split verdict (some T5s pass, some fail)? +3. **T5 consensus mechanics** → T3 aggregates all T5 results into a joint verdict. Split verdict (`partial`) triggers retry of failed slices only. See _T5 Consensus & Verdict Schema_. -4. **Path amendment mechanism** — When a mid-run tier proposes a path amendment, what's the concrete mechanism? Who writes to the blackboard, in what format, and how does the relevant higher tier get notified? +4. **Path amendment mechanism** → Amending tier writes a `path_amendment` event to blackboard. Runner monitors events table and notifies the relevant higher tier via a system event. No agent callback plumbing required. See _Path Amendment Mechanism_. -5. **Failure handling (distributed model)** — The current failure table assumes centralised runner handling. Needs to be rewritten to reflect distributed ownership: T3 handles T4 failures, T2 handles T3 failures, T1 handles T2 failures. Runner only handles T1 failure and terminal escalation to human. +5. **Failure handling (distributed model)** → Distributed ownership confirmed. Runner only owns T1 failure + terminal human escalation. See updated _Failure Handling_ table. --- @@ -167,6 +167,204 @@ T1 — Phase 1: Plan (self-critique → Andrew approval) --- +## Use Case Flows + +T1 assesses complexity and prescribes the tier path per workstream. Three standard depth profiles: + +### Full Stack — T1→T2→T3→T4→T5 +*Complex feature, new product, cross-domain changes* + +``` +T1 Plan + → assess complexity (high) + → output T1 Plan Schema (workstreams, tier paths [T2,T3,T4,T5], parallelism, retry budgets) + → self-critique pass + → GATE: surface to Andrew ← approval required + +T2 Lead (spawned by runner after approval) + → receive: goal + full workplan + → publish: domain boundaries + shared assumptions doc → blackboard + → GATE (optional): review boundaries before specialists spawn + +T2 Specialists (parallel fan-out, wait on Lead) + → each receives: their domain boundary + shared assumptions + → produce: architecture proposal for their slice + → Lead synthesises, drives conflict resolution if needed + → Lead writes: canonical architecture → blackboard + → GATE (recommended): review architecture before implementation + +Each T2 Specialist → spawns its own T3s (with canonical architecture slice + interface contracts) + +T3s (light mesh within T2 domain) + → write draft task lists to blackboard + → read peers' lists, reconcile boundaries + → commit merged task plan before T4 dispatch + → GATE (optional): review task breakdown + +T4s + → swarm: independent tasks run in parallel + → pipeline: T4-A output feeds T4-B (T3 declares dependencies) + → commit to feature branches + +T5s (fan-out per T4 slice) + → each reviews its slice independently + → T3 collects results → joint verdict + → GATE (optional): review T5 verdict before T3 marks done + → partial: T3 retries only failed slices + → pass: T3 signals workstream done to T2 + +T2 specialists → signal T2 Lead +T2 Lead → writes integration summary → blackboard + +T1 Accept + → validate against goal anchor + → open PR, notify Andrew via Hans +``` + +### Medium Complexity — T1→T3→T4→T5 +*Config change, isolated bug fix — T1 determines no cross-domain design needed* + +``` +T1 Plan + → assess: contained scope, single domain, no T2 architecture needed + → workplan: tier paths [T3, T4, T5] + → GATE: Andrew approval + +T3s spawned directly by runner + → receives T1 brief with task context (no T2 architecture layer) + → T3 light mesh → T4 dispatch → T5 verify → signal done + +T1 Accept → PR +``` + +### Simple / Hotfix — T1→T4→T5 +*Single file, single function, trivial atomic task* + +``` +T1 Plan + → assess: trivial, single workstream + → tier path: [T4, T5] + → GATE: Andrew approval + +T4 (coding agent) + → single atomic task, commits + +T5 (single verifier, not full fan-out) + → code review + correctness check + → pass → T1 Accept → PR +``` + +--- + +## Resolved Mechanics + +### T3 Mesh via Blackboard + +T3s coordinate task boundaries before dispatching T4s. All coordination goes through the blackboard — no direct agent-to-agent messaging. + +1. Each T3 writes its **draft task list** to the blackboard (one row per proposed T4 task, status `draft`) +2. Each T3 reads all sibling T3 draft lists in its T2 domain +3. T3s amend their lists to resolve overlaps (claim tasks, release duplicates) +4. Once all T3s in the domain have committed their final task lists (status `committed`), T4 dispatch begins +5. No T3 dispatches T4s until all peers in the domain are committed — this prevents duplicate work + +The runner monitors for `all_committed` state and can enforce a timeout (config: `t3_mesh_timeout_minutes`). + +--- + +### T1 Plan Output Schema + +T1's Plan phase produces a structured JSON object written to the blackboard. The runner parses this to bootstrap the pipeline. + +```json +{ + "run_id": "uuid", + "goal_anchor": "Original goal — immutable, propagated to every downstream brief", + "complexity": "high | medium | low", + "retry_budget_multiplier": 2, + "workstreams": [ + { + "id": "ws-backend-api", + "name": "Backend API", + "domain": "backend", + "tier_path": ["t2", "t3", "t4", "t5"], + "parallel_group": "A", + "t2_specialist": "agents/engineering/engineering-software-architect.md", + "notes": "Focus on webhook ingest and retry queue" + } + ], + "parallelism": { + "groups": { + "A": ["ws-backend-api", "ws-frontend"], + "B": ["ws-infra"] + }, + "sequence": ["A", "B"] + }, + "self_critique_summary": "Brief plain-text summary of what T1 identified and amended in its self-critique pass" +} +``` + +`parallel_group` + `sequence` handles inter-workstream dependencies: group A runs in parallel, then B starts after A completes. + +--- + +### T5 Consensus & Verdict Schema + +T3 aggregates all T5 results into a joint verdict after fan-out completes. + +**Individual T5 result:** +```json +{ + "verifier_id": "uuid", + "scope": "queue-client", + "verdict": "pass | fail", + "issues": ["issue description..."], + "notes": "human-readable summary" +} +``` + +**T3 joint verdict (written to blackboard):** +```json +{ + "t5_results": [...], + "joint_verdict": "pass | partial | fail", + "failed_scopes": ["queue-client"], + "summary": "Human-readable summary for gate surface and logs" +} +``` + +**Split verdict handling:** +- `pass` → T3 marks workstream done, signals T2 +- `partial` → T3 retries only the failed T4 slices (up to retry budget), re-runs T5 on those slices +- `fail` → T3 escalates to T2 (or T1 if shallow path) + +--- + +### Path Amendment Mechanism + +When a mid-run tier discovers scope that warrants a different tier path than T1 prescribed: + +1. The discovering tier writes a `path_amendment` event to the blackboard: +```json +{ + "kind": "path_amendment", + "proposed_by": "t3/ws-backend-api", + "reason": "Discovered auth dependency requires T2 architectural pass", + "amendment": { + "workstream": "ws-backend-api", + "add_tiers": ["t2"], + "insert_before": "t3" + } +} +``` +2. The runner monitors the events table, detects `path_amendment`, and sends a system event notification to the relevant higher tier +3. The higher tier is **informed, not blocked** — it acknowledges and adjusts its understanding +4. Amendment is logged on the blackboard for audit; no approval gate required (the next scheduled human gate will surface it) + +No agent needs callback plumbing. The runner is the notification bridge. + +--- + ## Shared State For software pipelines, **the repo is the primary blackboard**: @@ -185,14 +383,21 @@ Supplemented by a SQLite coordination store per run tracking: ## Failure Handling -| Failure | Handler | Action | -|---------|---------|--------| -| T4 bad output | T3 | Retry T4 with corrected brief (up to retry_budget) | -| T4 blocked | T3 | Escalate immediately — no retries | -| T4 partial output | T3 | Salvage good parts, re-task remainder | -| T3 workstream stuck | T2 | Re-scope or split the workstream | -| T2 design wrong | T1 | Re-plan; may discard workstream and restart | -| Repeated escalation | Surface to user | Block until human unblocks | +Distributed ownership — each tier handles failures in the tier below it. The runner only handles T1 failure and terminal human escalation. + +| Failure | Owner | Handler | Action | +|---------|-------|---------|--------| +| T4 bad output | T3 | `escalation.py` called by T3's context | Retry T4 with corrected brief (up to retry_budget) | +| T4 blocked | T3 | `escalation.py` | Escalate to T3 immediately — no retries | +| T4 partial output | T3 | `escalation.py` | Salvage good parts, re-task remainder | +| T5 partial verdict | T3 | T3 joint verdict logic | Retry failed T4 slices only | +| T5 full fail | T3 | T3 joint verdict logic | Escalate to T2 | +| T3 workstream stuck | T2 | T2 specialist prompt + blackboard | Re-scope or split the workstream | +| T2 design wrong | T1 | T1 Accept phase + blackboard | Re-plan; may discard workstream and restart | +| T1 failure / crash | Runner | `team_runner.py` | Surface to human, halt run | +| Repeated escalation | Runner | `team_runner.py` | Gate: block until human unblocks | + +**Key distinction:** `escalation.py` is not called by the runner centrally. It is logic that tier agents execute (or the runner executes on their behalf when it detects a timeout or dead agent). The runner only owns the last two rows. Retry limits prevent infinite loops. Escalation path is always upward, never sideways. @@ -264,6 +469,114 @@ T4 and T5 default to the **coding agent runtime** when available. Falls back to --- +## Run Visibility Layer + +Designed for debugging, test runs, and quality evaluation at each tier. Three interlocking components. + +### 1. Human-Readable Live Log + +Structured events from the blackboard rendered as a timestamped, readable stream. `agency watch ` tails this live. + +``` +[abc123] 12:30:01 T1 PLAN_START Assessing scope: "Build webhook ingestion system" +[abc123] 12:30:14 T1 PLAN_DONE 3 workstreams — backend-api, infra, docs (2 parallel) +[abc123] 12:30:14 GATE APPROVAL ⏸ Waiting on approval before T2 spawns +[abc123] 12:31:02 GATE APPROVED ✓ Approved — continuing +[abc123] 12:31:03 T2 LEAD_START Lead Architect spawned +[abc123] 12:31:41 T2 BOUNDS_READY Domain boundaries + shared assumptions published +[abc123] 12:31:42 T2 SPEC_START 3 specialists spawned (parallel): backend, infra, docs +[abc123] 12:32:15 T2 SPEC_DONE backend-api architecture draft ready +[abc123] 12:32:58 T2 SYNTH_DONE Canonical architecture written to blackboard +[abc123] 12:32:58 GATE INSPECTION ⏸ T2 synthesis ready for review +[abc123] 12:33:44 T3 MESH_START backend-api: 2 squad leads negotiating task boundaries +[abc123] 12:34:01 T3 MESH_DONE Task split committed — 7 T4 tasks (5 swarm, 2 pipeline) +[abc123] 12:34:02 T4 SWARM_START 5 workers spawned in parallel +[abc123] 12:35:10 T4 DONE worker-3 auth-middleware ✓ +[abc123] 12:35:22 T4 FAIL worker-4 queue-client ✗ (retry 1/3) +[abc123] 12:36:04 T4 DONE worker-4 queue-client ✓ (retry resolved) +[abc123] 12:36:05 T5 VERIFY_START 4 verifiers spawned +[abc123] 12:36:45 T5 VERDICT partial — queue-client needs rework +[abc123] 12:37:12 T5 VERDICT ✓ all pass — workstream backend-api done +``` + +Log level `verbose` adds per-T4-start/done lines. Default is `normal` (tier-level events only). + +### 2. Inspection Gates + +Configurable pause points. When the runner hits a gate, it: +1. Writes a `gate_pending` event to the blackboard +2. Fires `notify_adapter.send()` with a tier summary to Andrew (via Hans) +3. Halts — no next tier spawns until `gate_approved` or `gate_rejected` is written + +The tier summary surfaced at each gate includes: +- **What was produced** (the tier artifact in readable form) +- **What happens next** (which agents will spawn, doing what) +- **Any anomalies** flagged by the tier itself + +Configurable in `team.yaml` under `visibility.inspection_gates`. A `strict_mode: true` flag enables all gates — recommended for first runs on a new codebase or new goal type. + +```yaml +visibility: + strict_mode: false + log_level: normal # normal | verbose + inspection_gates: + t1_plan: true # always — required by design + t2_lead: false # optional — review boundaries before specialists + t2_synthesis: true # recommended — review architecture before implementation + t3_plan: false # verbose — useful early on, disable once T3 is trusted + t5_verdict: false # review T5 joint verdict before T3 marks workstream done + gate_timeout_minutes: 60 # auto-reject if no response within this window +``` + +### 3. Inspection CLI — `cli/agency.py` + +``` +agency run # start a run, returns run_id +agency watch # tail live log (follows blackboard events) +agency inspect # interactive tree view of run state +agency inspect --tier t2 # jump to T2 artifacts +agency inspect --brief # show full brief + result JSON + +agency approve # approve current gate → continue +agency approve --note "..." # approve with a note written to blackboard +agency reject --reason "..." # reject → tier re-invoked +agency pause # force-pause at next tier boundary +agency resume # release a manual pause +``` + +`agency inspect` (no flags) renders a live tree: +``` +Run abc123 — "Build webhook ingestion system" +├── T1 Plan ✓ +│ └── [view workplan] +├── T2 Architecture ✓ [GATE: pending review] +│ ├── [view domain boundaries] +│ ├── [view shared assumptions] +│ └── [view canonical architecture] +├── T3 backend-api (active) +│ ├── [view task breakdown] +│ └── T4 workers: 3/7 done, 1 retrying, 3 pending +└── T3 infra (pending) +``` + +### Blackboard Event Vocabulary (extended) + +```python +# existing +"spawned" | "completed" | "failed" | "escalated" | "retried" + +# new — visibility layer +"gate_pending" # runner hit a gate, waiting for human +"gate_approved" # human approved, run continues +"gate_rejected" # human rejected, tier re-invoked +"gate_paused" # manual pause via CLI +"gate_resumed" # manual resume via CLI +"path_amendment" # mid-run tier proposed path change +"log" # human-readable log line (level + message) +``` + +--- + ## Decisions Log **T1 dynamic dispatch** — T1 assesses scope and prescribes tier path and workstream parallelism. It does not prescribe internal tier coordination patterns. @@ -297,3 +610,15 @@ T4 and T5 default to the **coding agent runtime** when available. Falls back to **Coding agent runtime** — Claude Code is default T4/T5 runtime. Opt-in `native_teams` flag available for internal Claude Code parallelism — faster but less blackboard visibility. Default `false`. **Agency-agents integration** — Via git submodule at `agents/`. T1 selects specialists via `config/role_registry.yaml`. `agent_personality` field on task brief; runtime injects as system prompt at spawn time. + +**T3 mesh mechanics** — Blackboard-based coordination. T3s write draft task lists, read peers', reconcile overlaps, commit merged plan. No T4 dispatch until all T3s in the domain have committed. Runner enforces timeout (`t3_mesh_timeout_minutes` in config). Chosen over designated T3 lead or direct messaging — fits distributed ownership model, gives full audit trail for free. + +**T1 output schema** — Formal JSON schema defined (2026-03-30). Fields: `run_id`, `goal_anchor`, `complexity`, `retry_budget_multiplier`, `workstreams[]` (id, name, domain, tier_path, parallel_group, t2_specialist, notes), `parallelism` (groups + sequence), `self_critique_summary`. `parallel_group` + `sequence` handles inter-workstream dependencies. + +**T5 consensus** — T3 aggregates all T5 results into joint verdict: `pass | partial | fail`. Split verdict (`partial`) → T3 retries only failed slices, re-runs T5 on those slices. Full `fail` escalates to T2. T3 writes structured joint verdict to blackboard; this is what the optional T5 gate surfaces to Andrew. + +**Path amendment mechanism** — Amending tier writes `path_amendment` event to blackboard (structured JSON: proposed_by, reason, amendment). Runner monitors events table, sends system event notification to relevant higher tier. Higher tier is informed, not blocked. No agent callback plumbing. Amendments surface at next scheduled human gate. + +**Failure handling (distributed)** — Confirmed distributed ownership (2026-03-30). `escalation.py` is logic tiers execute (or runner executes on tier's behalf on timeout/crash), not a central runner concern. Runner only owns: T1 failure, terminal human escalation. See updated Failure Handling table. + +**Run visibility layer** — Added 2026-03-30. Human-readable live log, configurable inspection gates, and `cli/agency.py` inspection/control commands. Designed for debugging and quality evaluation at each tier during early runs. `strict_mode: true` enables all gates. Gates surface tier artifacts + "what happens next" summary to Andrew via Hans. Resolves Q3 (T5 consensus surfaces as gate event with human-readable summary). T5 gate (optional) lets Andrew review joint verdict before T3 marks workstream done. From 8f143e779dec6564cbd64e7e4a54cd5183a21488 Mon Sep 17 00:00:00 2001 From: Hans Heinemann Date: Mon, 30 Mar 2026 14:22:39 -0400 Subject: [PATCH 3/4] docs: resolve remaining 3 design questions (spawn ownership, gate UX, mesh timeout) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Spawn calls: runner owns all runtime_adapter.spawn() calls; tiers write status=pending briefs to blackboard, runner's spawn loop acts on them. Gate logic lives in the spawn loop — no gate plumbing needed in agents. - Gate approval UX: Signal reply via Hans + direct CLI both supported. Both write gate_approved to blackboard; runner doesn't care which path. Hans uses pending_gates.json for multi-run disambiguation. - T3 mesh timeout: escalate to T2 (domain boundary problem). If T2 also exhausts retry budget, normal escalation ladder handles it. No force-commit. Add pending_gates.json to directory structure and buildspec. Update runner step in build order with full spawn loop responsibilities. --- docs/buildspec.md | 4 ++- docs/design.md | 69 ++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 71 insertions(+), 2 deletions(-) diff --git a/docs/buildspec.md b/docs/buildspec.md index 2b3cac5..c462bcb 100644 --- a/docs/buildspec.md +++ b/docs/buildspec.md @@ -74,6 +74,8 @@ agent-teams/ ├── runs/ — runtime state, one subdir per run_id │ └── .gitkeep │ +├── pending_gates.json — live file: gates currently awaiting approval (written by runner, read by Hans) +│ └── README.md ``` @@ -488,7 +490,7 @@ T1 completes integration 7. `core/escalation.py` — retry + failure routing logic (called by tiers, not runner centrally) 8. `adapters/runtime/openclaw.py` — wire up sessions_spawn + personality injection 9. `adapters/runtime/claude_code.py` — coding agent runtime, personality via --system-prompt -10. `core/team_runner.py` — full run lifecycle: gate logic (gate_pending halt loop, gate_approved resume), path amendment monitor, T1 failure + terminal escalation only +10. `core/team_runner.py` — full run lifecycle: spawn loop (monitors briefs table for `status=pending`, calls runtime_adapter.spawn()), gate logic (gate_pending halt, writes pending_gates.json, gate_approved/rejected resume), path amendment monitor, T3 mesh timeout → T2 escalation, T1 failure + terminal escalation only 11. `cli/agency.py` — run, watch, inspect, approve, reject, pause, resume; `watch` tails blackboard events and renders live log; `inspect` renders run tree 12. `prompts/` — fallback tier prompts (used when no agent_personality set) 13. `adapters/vcs/github.py` — PR creation + branch management diff --git a/docs/design.md b/docs/design.md index 8486f51..9e28b9e 100644 --- a/docs/design.md +++ b/docs/design.md @@ -6,7 +6,7 @@ _Started: 2026-03-14. Last updated: 2026-03-30._ ## Resolved Design Decisions (formerly Open Questions) -All five open questions resolved 2026-03-30. Details in Decisions Log. +All eight open questions resolved 2026-03-30. Details in Decisions Log. 1. **T3 mesh mechanics** → Blackboard-based. T3s write draft task lists, read peers', commit merged plan before T4 dispatch. See _T3 Mesh via Blackboard_. @@ -18,6 +18,12 @@ All five open questions resolved 2026-03-30. Details in Decisions Log. 5. **Failure handling (distributed model)** → Distributed ownership confirmed. Runner only owns T1 failure + terminal human escalation. See updated _Failure Handling_ table. +6. **Who makes spawn calls for T3+ tiers** → Runner monitors briefs table for `status=pending` rows and makes all spawn calls. "Distributed ownership" means the tier's output determines brief content — runner is the mechanical arm. Gates (hold on `gate_pending`) live naturally in the runner's spawn loop. + +7. **Gate approval UX** → Both Signal reply (via Hans) and direct CLI are supported — both write to the same blackboard. Runner only cares that a `gate_approved` event exists, not who wrote it. Hans maintains `pending_gates.json` in workspace for multi-run disambiguation. + +8. **T3 mesh timeout** → Escalate to T2 (domain boundary problem, T2 should re-scope). If T2 also exhausts its retry budget, escalates up the normal ladder to T1 → Andrew gate. No force-commit fallback (would hide the problem and cause bad T4 dispatch). + --- --- @@ -340,6 +346,61 @@ T3 aggregates all T5 results into a joint verdict after fan-out completes. --- +### Spawn Call Ownership + +The runner is the single point of contact with the runtime adapter. Tiers do not call `sessions_spawn` directly — they write output to the blackboard and the runner acts on it. + +**Flow:** +1. A tier completes and writes child briefs to the `briefs` table with `status=pending` +2. Runner's spawn loop detects pending rows +3. If a gate is configured at this tier boundary → runner writes `gate_pending`, notifies Andrew, halts +4. On `gate_approved` → runner calls `runtime_adapter.spawn()` for each pending brief +5. Spawned agent runs, writes its own child briefs as pending when done → loop continues + +This keeps gate logic in one place (the runner's spawn loop), makes all spawn calls auditable from a single location, and means agents only need blackboard read/write access — no runtime adapter tool access required. + +--- + +### Gate Approval UX + +Two paths, both valid, same outcome — runner only cares that a `gate_approved` event exists in the blackboard: + +**Signal (via Hans):** +Andrew receives the tier summary from Hans in Signal. Replies "approve" or "reject: reason". Hans resolves which run + gate the reply refers to using `workspace/pending_gates.json` (maintained by runner on each `gate_pending` event), then runs `agency approve ` or `agency reject --reason "..."` on Andrew's behalf. Hans confirms back: "✅ Approved — T3 spawning now." + +**Direct CLI:** +Andrew runs `agency approve ` from his terminal. Zero-friction when already at a machine. + +**`pending_gates.json` format:** +```json +{ + "gates": [ + { + "run_id": "abc123", + "gate": "t2_synthesis", + "pending_since": "2026-03-30T14:00:00Z", + "summary": "T2 synthesis ready — canonical architecture written" + } + ] +} +``` + +If only one gate is pending, Hans can resolve "approve" without an explicit run_id. If multiple are pending, Hans asks Andrew to specify. + +--- + +### T3 Mesh Timeout + +If T3s in a domain fail to commit their task lists within `t3_mesh_timeout_minutes`: + +1. **Runner escalates to T2** — writes a `gate_pending` escalation event and notifies the T2 specialist that owns the domain. Context: which T3s timed out, what draft lists (if any) exist on the blackboard. T2 re-scopes or clarifies domain boundaries, spawns fresh T3 briefs. + +2. **If T2 also exhausts its retry budget** → normal escalation ladder: T2 failure → T1 handles → T1 failure → Andrew gate. + +Force-committing partial draft lists (optimistic fallback) is explicitly not done — it hides the boundary problem and produces conflicting or duplicate T4 tasks that fail later with less context. + +--- + ### Path Amendment Mechanism When a mid-run tier discovers scope that warrants a different tier path than T1 prescribed: @@ -611,6 +672,12 @@ Run abc123 — "Build webhook ingestion system" **Agency-agents integration** — Via git submodule at `agents/`. T1 selects specialists via `config/role_registry.yaml`. `agent_personality` field on task brief; runtime injects as system prompt at spawn time. +**Spawn call ownership** — Runner is the single point of contact with the runtime adapter. Tiers write `status=pending` child briefs to the blackboard; runner's spawn loop detects and spawns them. Gate logic (hold on `gate_pending`) lives in the spawn loop — no gate plumbing needed in agents. Agents only need blackboard read/write access. + +**Gate approval UX** — Both Signal reply (Hans as bridge) and direct `agency approve` CLI are supported. Same blackboard write either way; runner doesn't care which path was used. Hans maintains `pending_gates.json` in workspace to resolve ambiguous replies when multiple gates are pending. Single pending gate → "approve" is unambiguous. + +**T3 mesh timeout** — Escalate to T2 (the specialist that owns the domain). Timeout means T3s can't agree on task boundaries — a domain boundary problem T2 should fix by re-scoping. If T2 exhausts its retry budget, normal escalation ladder handles it (T1 → Andrew gate). No force-commit fallback. + **T3 mesh mechanics** — Blackboard-based coordination. T3s write draft task lists, read peers', reconcile overlaps, commit merged plan. No T4 dispatch until all T3s in the domain have committed. Runner enforces timeout (`t3_mesh_timeout_minutes` in config). Chosen over designated T3 lead or direct messaging — fits distributed ownership model, gives full audit trail for free. **T1 output schema** — Formal JSON schema defined (2026-03-30). Fields: `run_id`, `goal_anchor`, `complexity`, `retry_budget_multiplier`, `workstreams[]` (id, name, domain, tier_path, parallel_group, t2_specialist, notes), `parallelism` (groups + sequence), `self_critique_summary`. `parallel_group` + `sequence` handles inter-workstream dependencies. From 1c99e40f985b723598323adca75d050b693b6fb3 Mon Sep 17 00:00:00 2001 From: Hans Heinemann Date: Mon, 30 Mar 2026 14:31:55 -0400 Subject: [PATCH 4/4] docs: purge OpenClaw/Hans specifics from core design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Portability audit — all platform-specific concerns moved to adapter layer: - Gate Approval UX (Resolved Mechanics): rewritten as platform-agnostic. Core: runner writes gate_pending, calls notify_adapter.send(), polls blackboard for gate_approved. Universal path: agency CLI writes directly to blackboard. Adapter handles its own inbound response bridge internally. - pending_gates.json removed from core directory structure and runner responsibilities — adapter-internal state, not a core concern. - 'User → Hans → team_runner.start()' → 'User → team_runner.start()' Core has no dependency on a specific caller. - 'notify_adapter.send(...to Andrew via Hans)' → 'notify_adapter.send()' throughout design.md and buildspec.md. - anthropic.py description: 'via OpenClaw or direct API' → 'direct API' (anthropic adapter never goes via OpenClaw) - Output/review decision: 'Hans messages Andrew' → 'notify_adapter.send()' - Run visibility decision: 'Andrew via Hans' → 'via notify_adapter.send()' - Decisions log: gate approval and visibility entries rewritten accordingly Adapter layer correctly unchanged: adapters/notify/openclaw.py — OpenClaw-specific, owns its inbound bridge adapters/runtime/openclaw.py — OpenClaw sessions_spawn, correctly isolated team.yaml example config — adapter selection is config, not core --- docs/buildspec.md | 12 +++++------- docs/design.md | 40 +++++++++++++++------------------------- 2 files changed, 20 insertions(+), 32 deletions(-) diff --git a/docs/buildspec.md b/docs/buildspec.md index c462bcb..d4152aa 100644 --- a/docs/buildspec.md +++ b/docs/buildspec.md @@ -40,7 +40,7 @@ agent-teams/ │ │ ├── notify.py — abstract notification interface │ │ └── runtime.py — abstract agent runtime interface │ ├── llm/ -│ │ ├── anthropic.py — Claude via OpenClaw or direct API +│ │ ├── anthropic.py — Claude via direct Anthropic API │ │ ├── openai.py — GPT / o-series │ │ └── ollama.py — local models │ ├── vcs/ @@ -74,8 +74,6 @@ agent-teams/ ├── runs/ — runtime state, one subdir per run_id │ └── .gitkeep │ -├── pending_gates.json — live file: gates currently awaiting approval (written by runner, read by Hans) -│ └── README.md ``` @@ -387,7 +385,7 @@ t5: ### 1. Run Kickoff ``` -User → Hans → team_runner.start(goal, config) +User → team_runner.start(goal, config) # via CLI or any caller → generate run_id → init blackboard (create runs//blackboard.db) → build T1 brief (goal_anchor = goal, retry_budget from config) @@ -442,7 +440,7 @@ spawn T4 with brief ``` runner reaches configured gate (e.g. t2_synthesis) → write event(gate_pending, detail={tier, summary, what_happens_next}) - → notify_adapter.send(tier summary to Andrew via Hans) + → notify_adapter.send(tier summary + gate context) → halt: poll blackboard for gate_approved or gate_rejected gate_approved: @@ -490,11 +488,11 @@ T1 completes integration 7. `core/escalation.py` — retry + failure routing logic (called by tiers, not runner centrally) 8. `adapters/runtime/openclaw.py` — wire up sessions_spawn + personality injection 9. `adapters/runtime/claude_code.py` — coding agent runtime, personality via --system-prompt -10. `core/team_runner.py` — full run lifecycle: spawn loop (monitors briefs table for `status=pending`, calls runtime_adapter.spawn()), gate logic (gate_pending halt, writes pending_gates.json, gate_approved/rejected resume), path amendment monitor, T3 mesh timeout → T2 escalation, T1 failure + terminal escalation only +10. `core/team_runner.py` — full run lifecycle: spawn loop (monitors briefs table for `status=pending`, calls runtime_adapter.spawn()), gate logic (gate_pending halt, calls notify_adapter.send(), polls for gate_approved/rejected resume), path amendment monitor, T3 mesh timeout → T2 escalation, T1 failure + terminal escalation only 11. `cli/agency.py` — run, watch, inspect, approve, reject, pause, resume; `watch` tails blackboard events and renders live log; `inspect` renders run tree 12. `prompts/` — fallback tier prompts (used when no agent_personality set) 13. `adapters/vcs/github.py` — PR creation + branch management -14. `adapters/notify/openclaw.py` — Hans notification; used for gate surfaces (tier summary to Andrew) +14. `adapters/notify/openclaw.py` — OpenClaw notification adapter; bridges gate summaries and run events to the operator via OpenClaw; manages its own inbound response state for gate approval routing 15. `config/team.yaml` — example config with full visibility block 16. `README.md` — how to run, how to add adapters, how to extend the roster; include `agency` CLI reference diff --git a/docs/design.md b/docs/design.md index 9e28b9e..4b18256 100644 --- a/docs/design.md +++ b/docs/design.md @@ -20,7 +20,7 @@ All eight open questions resolved 2026-03-30. Details in Decisions Log. 6. **Who makes spawn calls for T3+ tiers** → Runner monitors briefs table for `status=pending` rows and makes all spawn calls. "Distributed ownership" means the tier's output determines brief content — runner is the mechanical arm. Gates (hold on `gate_pending`) live naturally in the runner's spawn loop. -7. **Gate approval UX** → Both Signal reply (via Hans) and direct CLI are supported — both write to the same blackboard. Runner only cares that a `gate_approved` event exists, not who wrote it. Hans maintains `pending_gates.json` in workspace for multi-run disambiguation. +7. **Gate approval UX** → `agency approve ` CLI writes `gate_approved` directly to the blackboard — the universal path, works on any platform. Runner only cares that the event exists, not how it got there. Notify adapter implementations handle their own inbound response routing (e.g. bridging a chat reply to a CLI call) as internal adapter state — not a core concern. 8. **T3 mesh timeout** → Escalate to T2 (domain boundary problem, T2 should re-scope). If T2 also exhausts its retry budget, escalates up the normal ladder to T1 → Andrew gate. No force-commit fallback (would hide the problem and cause bad T4 dispatch). @@ -224,7 +224,7 @@ T2 Lead → writes integration summary → blackboard T1 Accept → validate against goal anchor - → open PR, notify Andrew via Hans + → open PR, notify_adapter.send(pr summary + url) ``` ### Medium Complexity — T1→T3→T4→T5 @@ -363,29 +363,19 @@ This keeps gate logic in one place (the runner's spawn loop), makes all spawn ca ### Gate Approval UX -Two paths, both valid, same outcome — runner only cares that a `gate_approved` event exists in the blackboard: +**Core mechanic (platform-agnostic):** -**Signal (via Hans):** -Andrew receives the tier summary from Hans in Signal. Replies "approve" or "reject: reason". Hans resolves which run + gate the reply refers to using `workspace/pending_gates.json` (maintained by runner on each `gate_pending` event), then runs `agency approve ` or `agency reject --reason "..."` on Andrew's behalf. Hans confirms back: "✅ Approved — T3 spawning now." +1. Runner writes `gate_pending` to blackboard +2. Runner calls `notify_adapter.send()` with tier summary + gate context (`run_id`, `gate`, `summary`, `what_happens_next`) +3. Runner polls blackboard for `gate_approved` or `gate_rejected` +4. `agency approve ` / `agency reject --reason "..."` writes the event directly to the blackboard — the universal approval path, works on any platform with filesystem access -**Direct CLI:** -Andrew runs `agency approve ` from his terminal. Zero-friction when already at a machine. +Runner never reads from a state file, never talks to a notify adapter for inbound responses. It only polls the blackboard. -**`pending_gates.json` format:** -```json -{ - "gates": [ - { - "run_id": "abc123", - "gate": "t2_synthesis", - "pending_since": "2026-03-30T14:00:00Z", - "summary": "T2 synthesis ready — canonical architecture written" - } - ] -} -``` +**Adapter responsibility:** +Each notify adapter handles its own inbound response routing. How a human's approval gets translated into an `agency approve` CLI call is entirely the adapter's concern — not core. Example: an OpenClaw adapter bridges a chat reply to the CLI. A Slack adapter wires up a slash command. A webhook adapter listens on an endpoint. All produce the same result: `gate_approved` written to blackboard. -If only one gate is pending, Hans can resolve "approve" without an explicit run_id. If multiple are pending, Hans asks Andrew to specify. +Any internal state the adapter needs to resolve ambiguous responses (e.g. which run_id an approval refers to when multiple gates are pending) is managed by the adapter, not the core. --- @@ -566,7 +556,7 @@ Log level `verbose` adds per-T4-start/done lines. Default is `normal` (tier-leve Configurable pause points. When the runner hits a gate, it: 1. Writes a `gate_pending` event to the blackboard -2. Fires `notify_adapter.send()` with a tier summary to Andrew (via Hans) +2. Fires `notify_adapter.send()` with the tier summary + gate context 3. Halts — no next tier spawns until `gate_approved` or `gate_rejected` is written The tier summary surfaced at each gate includes: @@ -660,7 +650,7 @@ Run abc123 — "Build webhook ingestion system" **Orchestration patterns** — Baked into tier prompts and runner tier-handling logic, not prescribed by T1. T2: Lead + parallel specialists. T3: light mesh within T2 domain. T4: swarm+pipeline. T5: fan-out+consensus. -**Output / review** — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew. Notification is dual: Hans messages Andrew directly + PR opened on VCS. Merge is gated on human sign-off. +**Output / review** — Nothing merges to main without explicit human approval. T1 opens a PR and fires `notify_adapter.send()` with the PR summary. Merge is gated on human sign-off. The notify adapter implementation determines how the notification is delivered. **Platform agnosticism** — Core is provider and platform agnostic. Capability levels (`reasoning-heavy`, `capable`, `fast-cheap`) map to models in config. Mixing providers across tiers is supported. @@ -674,7 +664,7 @@ Run abc123 — "Build webhook ingestion system" **Spawn call ownership** — Runner is the single point of contact with the runtime adapter. Tiers write `status=pending` child briefs to the blackboard; runner's spawn loop detects and spawns them. Gate logic (hold on `gate_pending`) lives in the spawn loop — no gate plumbing needed in agents. Agents only need blackboard read/write access. -**Gate approval UX** — Both Signal reply (Hans as bridge) and direct `agency approve` CLI are supported. Same blackboard write either way; runner doesn't care which path was used. Hans maintains `pending_gates.json` in workspace to resolve ambiguous replies when multiple gates are pending. Single pending gate → "approve" is unambiguous. +**Gate approval UX** — `agency approve ` CLI is the universal approval path — writes `gate_approved` directly to blackboard. Runner only polls blackboard; it does not depend on any specific notification platform. Each notify adapter handles its own inbound response bridge as internal adapter state. Core has no `pending_gates.json` or platform-specific approval logic. **T3 mesh timeout** — Escalate to T2 (the specialist that owns the domain). Timeout means T3s can't agree on task boundaries — a domain boundary problem T2 should fix by re-scoping. If T2 exhausts its retry budget, normal escalation ladder handles it (T1 → Andrew gate). No force-commit fallback. @@ -688,4 +678,4 @@ Run abc123 — "Build webhook ingestion system" **Failure handling (distributed)** — Confirmed distributed ownership (2026-03-30). `escalation.py` is logic tiers execute (or runner executes on tier's behalf on timeout/crash), not a central runner concern. Runner only owns: T1 failure, terminal human escalation. See updated Failure Handling table. -**Run visibility layer** — Added 2026-03-30. Human-readable live log, configurable inspection gates, and `cli/agency.py` inspection/control commands. Designed for debugging and quality evaluation at each tier during early runs. `strict_mode: true` enables all gates. Gates surface tier artifacts + "what happens next" summary to Andrew via Hans. Resolves Q3 (T5 consensus surfaces as gate event with human-readable summary). T5 gate (optional) lets Andrew review joint verdict before T3 marks workstream done. +**Run visibility layer** — Added 2026-03-30. Human-readable live log, configurable inspection gates, and `cli/agency.py` inspection/control commands. Designed for debugging and quality evaluation at each tier during early runs. `strict_mode: true` enables all gates. Gates surface tier artifacts + "what happens next" summary via `notify_adapter.send()` — platform-agnostic. Resolves Q3 (T5 consensus surfaces as gate event with human-readable summary). T5 gate (optional) lets the operator review joint verdict before T3 marks workstream done.