diff --git a/docs/design.md b/docs/design.md index f51061f..5679dc9 100644 --- a/docs/design.md +++ b/docs/design.md @@ -1,6 +1,24 @@ # Tiered Agent Team System — Design Document -_Started: 2026-03-14. Status: Pre-build, gathering requirements._ +_Started: 2026-03-14. Last updated: 2026-03-16 (evening)._ + +--- + +## Open Design Questions + +The following areas are identified but not yet resolved. Work through these before implementing `core/team_runner.py`. + +1. **T3 mesh mechanics** — How do T3s within the same T2 domain coordinate? Via blackboard, direct message exchange, or a designated T3 lead? What does "negotiate task boundaries" look like concretely? + +2. **T1 output schema** — What does T1's Plan phase output look like as structured data? Needs a formal schema: workstreams, tier paths, parallelism flags, retry budget, T2 specialist list. This is what the runner parses to bootstrap the pipeline. + +3. **T5 consensus mechanics** — Individual T5s review their slice and produce results. Who aggregates? What does the joint verdict look like as structured data? What happens on split verdict (some T5s pass, some fail)? + +4. **Path amendment mechanism** — When a mid-run tier proposes a path amendment, what's the concrete mechanism? Who writes to the blackboard, in what format, and how does the relevant higher tier get notified? + +5. **Failure handling (distributed model)** — The current failure table assumes centralised runner handling. Needs to be rewritten to reflect distributed ownership: T3 handles T4 failures, T2 handles T3 failures, T1 handles T2 failures. Runner only handles T1 failure and terminal escalation to human. + +--- --- @@ -16,7 +34,7 @@ A dynamic, hierarchical multi-agent system for software pipelines. Teams assembl Each tier thinks differently — strategy, design, coordination, execution, verification. Adding a tier only makes sense if it introduces a genuinely different mode of reasoning. **2. Depth is proportional to complexity.** -Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack. +Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack. T1 assesses scope and prescribes the path — it is never pre-configured. **3. Goal anchoring at every level.** T1's original intent is embedded in every agent's context — not just passed to T2 and forgotten. Every agent knows the end goal even if they only own a slice. @@ -24,8 +42,8 @@ T1's original intent is embedded in every agent's context — not just passed to **4. Artifacts, not summaries.** Tiers pass structured specs downward (JSON task briefs), not paraphrased prose. Meaning is preserved; format is compressed. -**5. Verification is bidirectional.** -Lower tiers verify correctness. Upper tiers verify alignment with original intent. Both directions catch different failure modes. +**5. Verification is mandatory.** +T5 always runs. Nothing returns to T1 unverified. T5 is a quality gate, not optional — things should work and work well before they surface upward. **6. Provider agnostic.** The system makes no assumptions about which LLM provider or platform is in use. Tiers reference capability levels, not specific models. All external dependencies are swappable adapters. @@ -39,52 +57,112 @@ Tiers define structure and responsibility. Agent personalities define domain exp | Tier | Role | Owns | Capability Level | |------|------|------|-----------------| -| T1 | Visionary | Goal, constraints, final acceptance, architectural bets | reasoning-heavy | +| T1 | Visionary | Goal, constraints, dispatch plan, final acceptance | reasoning-heavy | | T2 | Architect | System design, interface contracts, workstream boundaries | reasoning-heavy / capable | -| T3 | Squad Lead | Workstream delivery, worker coordination, quality gate | capable | +| T3 | Squad Lead | Workstream delivery, T4 management, quality gate | capable | | T4 | Implementer | Atomic task execution (one file, one function, one test) | fast-cheap | | T5 | Verifier | Validation of T4 output — correctness + intent alignment | capable | -T5 runs **parallel to T4**, not above it. It's a quality gate, not a management layer. +T5 runs **within T3's scope**, not above it. T3 commissions T5 verification of its T4 outputs. T5 is a quality gate, not a management layer. Capability levels map to actual models per provider in config — the core system never references a specific model name. --- -## Variable Depth +## Dispatch Model -``` -Config change T3 → T4 -New feature T2 → T3 → T4 -Major refactor T1 → T2 → T3 → T4 → T5 -New system / product T1 → T2 → T3s (parallel) → T4s → T5s -``` +### T1 Owns the Plan -T3 assesses scope on receipt. If a task is simple enough, it handles it directly without spawning upward or waiting for T2 sign-off. +T1 is not just a decomposer — it is the dispatch planner. Its output declares: + +- **Workstreams** — the decomposed units of work +- **Tier path per workstream** — which tiers to engage (e.g. `[T2, T3, T4, T5]` or `[T4, T5]` for trivial tasks) +- **Parallelism** — which workstreams are independent and can run concurrently + +T1 does not prescribe how each tier operates internally. That is the tier's own concern. + +### T1 Lifecycle — Two Explicit Phases + +T1 is invoked twice per run, each with a distinct prompt and purpose: + +**Phase 1 — Plan:** +1. T1 produces initial dispatch plan (workstreams, tier paths, parallelism, retry budget) +2. T1 self-critiques its own plan in a single follow-up pass ("what could go wrong, what did I miss?") and amends +3. Amended plan surfaces to Andrew for approval — no T2s spawn until approval is given + +**Phase 2 — Accept:** +After the full T2→T3→T4→T5 pipeline completes, T1 is re-invoked with the final output. It validates against the original goal anchor and either accepts (opens PR) or rejects (escalates back down). + +Both phases are named explicitly in the task brief schema and tracked on the blackboard. + +### Each Tier Owns the Layer Below + +Control flow is distributed, not centralised: + +- T1 manages its T2s +- T2 Lead manages T2 specialists and their domain boundaries +- T2 specialists each own their T3s +- **T3 manages its T4s** — including dependency graph, parallelism, and T5 commissioning +- The runner is thin: bootstrap T1, monitor the blackboard, handle final result and notifications + +This means orchestration logic lives in agent prompts and output schemas — not in Python runner code. Adding a new execution pattern means updating a prompt, not the runner. + +**Tradeoff:** Debugging is harder. When something fails mid-chain, you read blackboard logs rather than step through central runner code. This is a tooling problem to solve (good blackboard inspection), not a design flaw to avoid. + +### Dynamic Paths + +Tiers can propose path amendments mid-run (e.g. T3 discovers scope that warrants a T2 pass it didn't get). Amendments are logged to the blackboard. Higher tiers are notified but do not need to approve — it is informational. No tier silently changes the plan. + +--- + +## Orchestration Patterns Per Tier + +Different tiers suit different internal coordination patterns. These are baked into the runner's tier-handling logic and the tier prompts — not prescribed by T1. + +| Tier | Pattern | Rationale | +|------|---------|-----------| +| T1 | Single agent, two phases | Must be authoritative; plan phase + accept phase | +| T2 Lead | Coordinator | Spawned first; defines boundaries + shared assumptions; drives conflict resolution; produces canonical architecture | +| T2 Specialists | Parallel fan-out | Each works independently within its domain; reads Lead's boundaries + shared assumptions doc before starting | +| T3 | Light mesh | Peer coordination within same T2 domain to negotiate task boundaries before T4 dispatch | +| T4 | Swarm + pipeline hybrid | Independent tasks run as swarm; dependent tasks pipeline (T4-A's output feeds T4-B). T3 declares which is which. | +| T5 | Parallel fan-out + consensus | Each T5 reviews its slice independently, then compares notes for a joint verdict — catches both artifact bugs and integration issues | + +### T2 Flow in Detail + +1. T1 spawns **T2 Lead Architect** with goal + workstream context +2. Lead defines explicit **domain boundaries** (who owns what, hard edges) +3. Lead publishes **shared assumptions doc** — cross-cutting concerns, key conventions, architectural constraints (auth approach, data formats, API patterns, etc.) +4. T1 spawns **T2 specialists** with boundaries + shared assumptions baked into their briefs +5. Specialists work in parallel, each within their defined domain +6. Lead reads all proposals, drives **conflict resolution** with relevant specialists if needed (cycle limit in config — fixed, not per-workstream) +7. Lead produces **canonical architecture** → written to blackboard as distinct artifact +8. T1 (Accept phase) validates canonical architecture against goal anchor +9. Canonical architecture becomes T3 briefs — each T2 specialist hands off to its own T3s --- ## Horizontal Scaling Within Tiers -Each tier can have multiple agents running in parallel: - ``` -T1 (1–2 agents) -├── T2: Backend Architect -│ ├── T3: API Squad Lead -│ │ ├── T4: Worker — endpoint A -│ │ ├── T4: Worker — endpoint B -│ │ └── T5: Verifier -│ └── T3: DB Squad Lead -│ ├── T4: Worker — migrations -│ └── T5: Verifier -├── T2: Frontend Architect -│ └── T3: UI Squad Lead -│ ├── T4: Worker — component X -│ └── T4: Worker — component Y -└── T2: Infra Architect - └── T3: Platform Squad Lead - └── T4: Worker — config / deploy +T1 — Phase 1: Plan (self-critique → Andrew approval) +│ +├── T2: Lead Architect (boundaries + shared assumptions first) +│ ├── T2: Backend Architect ─┐ +│ ├── T2: Frontend Architect ├─ parallel, within defined domains +│ └── T2: Infra Architect ─┘ +│ │ +│ └── (Lead synthesises → conflict resolution if needed → canonical architecture) +│ +├── T2 Backend Architect owns: +│ ├── T3: API Squad Lead ─┐ +│ └── T3: DB Squad Lead ─┴─ light mesh within domain +│ ├── T4: Worker A ─┐ +│ ├── T4: Worker B ─┼─ swarm / pipeline (T3 decides) +│ └── T4: Worker C ─┘ +│ └── T5: Verifier(s) — fan-out + consensus +│ +└── T1 — Phase 2: Accept (validates against goal anchor → PR) ``` --- @@ -97,7 +175,11 @@ For software pipelines, **the repo is the primary blackboard**: - T2 architects own integration branches - T1 does final integration and acceptance -Supplemented by a SQLite coordination store per run tracking in-flight workstreams, handoff artifacts, tier status, and retry counts. +Supplemented by a SQLite coordination store per run tracking: +- In-flight workstreams and their current execution plans +- Handoff artifacts and tier status +- Retry counts and escalation history +- Path amendments (proposed, by whom, timestamp) --- @@ -114,6 +196,8 @@ Supplemented by a SQLite coordination store per run tracking in-flight workstrea Retry limits prevent infinite loops. Escalation path is always upward, never sideways. +T1 sets a retry budget multiplier during scope assessment (`1x` simple, `2x` complex). Retry budget is a field on the task brief — not hardcoded in the runner. + --- ## Agent Talent Pool @@ -150,7 +234,7 @@ T1 selects the right specialist from the roster when building workstream briefs. | T5 | Performance | testing-performance-benchmarker | | T5 | Security | security-engineer | -The roster is not fixed — T1 can select any agent from the library based on workstream needs. Non-engineering agents (design, marketing, product) extend the system to non-software pipelines. +The roster is not fixed — T1 can select any agent from the library based on workstream needs. --- @@ -160,7 +244,7 @@ Everything external is a swappable adapter. Core logic never imports from adapte ``` Core (platform-agnostic) -├── team_runner — run lifecycle, agent spawning, runtime selection +├── team_runner — thin bootstrap: spawn T1, monitor blackboard, handle result ├── blackboard — SQLite coordination state ├── task_brief — schema + validation └── escalation — retry logic, failure routing @@ -176,33 +260,40 @@ Adapters (swappable) Swapping providers means writing a new adapter file — nothing in core changes. -T4 and T5 default to the **coding agent runtime** when available. It provides direct file system access, git operations, and test execution — no need to shuttle file contents through message context. Falls back to standard runtime gracefully if not configured. +T4 and T5 default to the **coding agent runtime** when available. Falls back to standard runtime gracefully if not configured. --- -## Decisions +## Decisions Log -**Depth decision** — T1 assesses scope on receipt and determines how many tiers to engage. Not pre-configured per task type. +**T1 dynamic dispatch** — T1 assesses scope and prescribes tier path and workstream parallelism. It does not prescribe internal tier coordination patterns. -**Trigger mechanism** — User messages Hans → Hans spins up T1 with the goal. T1 takes it from there. +**T1 two-phase lifecycle** — T1 has two explicit named phases: Plan and Accept. Plan phase includes self-critique (single pass) then human approval gate before T2s spawn. Accept phase validates final output against goal anchor. Both phases tracked on blackboard with distinct prompts. -**Output / review** — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew for review. Merge is gated on human sign-off. Notification is dual: Hans messages Andrew directly, and a PR is opened on the VCS platform so Andrew gets notified natively too. This keeps the review step platform-independent — whichever VCS is in use, Hans always notifies Andrew directly as a fallback. +**T1 self-critique** — Single pass only. Diminishing returns on multiple self-critique iterations; the human review after is the real safety net. Self-critique catches obvious gaps; Andrew catches strategic ones. -**Retry limits** — Three failure types, handled differently: -- *Bad output* → retry T4 with a corrected brief (default: 3 retries) -- *Blocked* → escalate immediately, no retries -- *Partial output* → salvage good parts, re-task the remainder +**Distributed ownership** — Each tier owns the layer below it. Runner is thin. Tradeoff: distributed control makes the system extensible but debugging requires good blackboard tooling, not central runner traces. -T1 sets a retry budget multiplier during scope assessment (`1x` simple, `2x` complex). Retry budget is a field on the task brief — not hardcoded in the runner. +**T5 always mandatory** — No skipping verification. Things should work and work well before surfacing to T1. -**Platform agnosticism** — Core logic is provider and platform agnostic. LLMs, VCS, notifications, and agent runtimes are all adapters. Tiers reference capability levels (`reasoning-heavy`, `capable`, `fast-cheap`), not specific model names. Provider-to-model mapping lives in config. +**T3 owns T4 and T5** — T3 manages its T4s (dependency graph, swarm vs pipeline, parallelism) and commissions T5 verification of T4 outputs. Runner does not orchestrate T4/T5 centrally. -**LLM provider** — Anthropic first implementation. Config supports per-tier provider selection and mixing providers across tiers (e.g. T1 on OpenAI o3, T4 workers on local Ollama). +**T2 Lead Architect** — Dedicated T2 role, not a new tier. Spawned first by T1. Owns: domain boundary definition, shared assumptions doc, conflict resolution between specialists, canonical architecture synthesis. Specialists spawn after Lead publishes boundaries + assumptions. Each T2 specialist owns its own T3s — no T3 spans T2 domains. -**Gateway modification** — Decided against. Agent-teams stays standalone Python. OpenClaw is used as the runtime adapter via existing primitives (sessions_spawn, sessions_send, subagents) — called through a skill layer. No gateway fork. Keeps platform agnosticism intact and avoids Node/Python mismatch and fork maintenance burden. +**T2 conflict resolution** — Lead sends targeted briefs back to conflicting specialists. Cycle limit is a fixed config value (not per-workstream). Single T1 self-critique parallel: fixed limit, not variable. -**Coding agent runtime** — Claude Code is the default T4/T5 runtime for software pipelines. It is purpose-built for implementation and verification: direct file access, git ops, test execution. Enters as a runtime adapter — swappable for Codex, Aider, or any equivalent. T1/T2/T3 always use the standard runtime (they reason, they don't edit files). +**T2 shared assumptions** — Lead publishes cross-cutting concerns (auth, data formats, API conventions, etc.) before specialists start. Specialists design with shared baseline; implicit dependencies pre-empted rather than caught in synthesis. -**Claude Code native teams** — Claude Code has an experimental agent teams feature that fans out sub-agents internally within a session. Integrated as an opt-in flag (`native_teams: true`) in the coding_agent runtime adapter. When enabled, T3 hands a full workstream to Claude Code and it parallelises internally — faster, but less granular blackboard visibility. Default is `false` — explicit T4 spawning is the baseline; native teams is a speed optimisation to enable deliberately. +**Orchestration patterns** — Baked into tier prompts and runner tier-handling logic, not prescribed by T1. T2: Lead + parallel specialists. T3: light mesh within T2 domain. T4: swarm+pipeline. T5: fan-out+consensus. -**Agency-agents integration** — Agent personalities sourced from [msitarzewski/agency-agents](https://github.com/msitarzewski/agency-agents) via git submodule. Included as `agents/` in the repo. T1 selects specialists from the roster via `config/role_registry.yaml`. Each task brief carries an `agent_personality` field (path to the agent .md file) which the runtime adapter injects as the system prompt at spawn time. Adding new specialists means adding an entry to the registry — no core changes required. +**Output / review** — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew. Notification is dual: Hans messages Andrew directly + PR opened on VCS. Merge is gated on human sign-off. + +**Platform agnosticism** — Core is provider and platform agnostic. Capability levels (`reasoning-heavy`, `capable`, `fast-cheap`) map to models in config. Mixing providers across tiers is supported. + +**LLM provider** — Anthropic first implementation. Config supports per-tier provider selection. + +**Gateway modification** — Decided against. Agent-teams stays standalone Python. OpenClaw used via runtime adapter only. + +**Coding agent runtime** — Claude Code is default T4/T5 runtime. Opt-in `native_teams` flag available for internal Claude Code parallelism — faster but less blackboard visibility. Default `false`. + +**Agency-agents integration** — Via git submodule at `agents/`. T1 selects specialists via `config/role_registry.yaml`. `agent_personality` field on task brief; runtime injects as system prompt at spawn time.