docs: add design doc and buildspec (#5)

2026-03-16 15:51:14 -04:00
parent 084cfb0bb2
commit 72bd744664
2 changed files with 645 additions and 0 deletions
--- a/docs/design.md
+++ b/docs/design.md
@@ -0,0 +1,208 @@
+# Tiered Agent Team System — Design Document
+
+_Started: 2026-03-14. Status: Pre-build, gathering requirements._
+
+---
+
+## Overview
+
+A dynamic, hierarchical multi-agent system for software pipelines. Teams assemble on demand, execute, then disband. Inspired by a blend of Hollywood production (dynamic assembly), consulting firms (structured deliverables, hierarchical synthesis), and two-pizza teams (small autonomous squads, clear domain ownership).
+
+---
+
+## Core Principles
+
+**1. Tiers represent cognitive modes, not org chart levels.**
+Each tier thinks differently — strategy, design, coordination, execution, verification. Adding a tier only makes sense if it introduces a genuinely different mode of reasoning.
+
+**2. Depth is proportional to complexity.**
+Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack.
+
+**3. Goal anchoring at every level.**
+T1's original intent is embedded in every agent's context — not just passed to T2 and forgotten. Every agent knows the end goal even if they only own a slice.
+
+**4. Artifacts, not summaries.**
+Tiers pass structured specs downward (JSON task briefs), not paraphrased prose. Meaning is preserved; format is compressed.
+
+**5. Verification is bidirectional.**
+Lower tiers verify correctness. Upper tiers verify alignment with original intent. Both directions catch different failure modes.
+
+**6. Provider agnostic.**
+The system makes no assumptions about which LLM provider or platform is in use. Tiers reference capability levels, not specific models. All external dependencies are swappable adapters.
+
+**7. Specialist talent pool.**
+Tiers define structure and responsibility. Agent personalities define domain expertise. The two are separate — the same tier can be filled by different specialists depending on the workstream domain.
+
+---
+
+## Tier Definitions
+
+| Tier | Role | Owns | Capability Level |
+|------|------|------|-----------------|
+| T1 | Visionary | Goal, constraints, final acceptance, architectural bets | reasoning-heavy |
+| T2 | Architect | System design, interface contracts, workstream boundaries | reasoning-heavy / capable |
+| T3 | Squad Lead | Workstream delivery, worker coordination, quality gate | capable |
+| T4 | Implementer | Atomic task execution (one file, one function, one test) | fast-cheap |
+| T5 | Verifier | Validation of T4 output — correctness + intent alignment | capable |
+
+T5 runs **parallel to T4**, not above it. It's a quality gate, not a management layer.
+
+Capability levels map to actual models per provider in config — the core system never references a specific model name.
+
+---
+
+## Variable Depth
+
+```
+Config change          T3 → T4
+New feature            T2 → T3 → T4
+Major refactor         T1 → T2 → T3 → T4 → T5
+New system / product   T1 → T2 → T3s (parallel) → T4s → T5s
+```
+
+T3 assesses scope on receipt. If a task is simple enough, it handles it directly without spawning upward or waiting for T2 sign-off.
+
+---
+
+## Horizontal Scaling Within Tiers
+
+Each tier can have multiple agents running in parallel:
+
+```
+T1 (1–2 agents)
+├── T2: Backend Architect
+│   ├── T3: API Squad Lead
+│   │   ├── T4: Worker — endpoint A
+│   │   ├── T4: Worker — endpoint B
+│   │   └── T5: Verifier
+│   └── T3: DB Squad Lead
+│       ├── T4: Worker — migrations
+│       └── T5: Verifier
+├── T2: Frontend Architect
+│   └── T3: UI Squad Lead
+│       ├── T4: Worker — component X
+│       └── T4: Worker — component Y
+└── T2: Infra Architect
+    └── T3: Platform Squad Lead
+        └── T4: Worker — config / deploy
+```
+
+---
+
+## Shared State
+
+For software pipelines, **the repo is the primary blackboard**:
+- T4 workers commit to feature branches
+- T3 leads review and merge to workstream branches
+- T2 architects own integration branches
+- T1 does final integration and acceptance
+
+Supplemented by a SQLite coordination store per run tracking in-flight workstreams, handoff artifacts, tier status, and retry counts.
+
+---
+
+## Failure Handling
+
+| Failure | Handler | Action |
+|---------|---------|--------|
+| T4 bad output | T3 | Retry T4 with corrected brief (up to retry_budget) |
+| T4 blocked | T3 | Escalate immediately — no retries |
+| T4 partial output | T3 | Salvage good parts, re-task remainder |
+| T3 workstream stuck | T2 | Re-scope or split the workstream |
+| T2 design wrong | T1 | Re-plan; may discard workstream and restart |
+| Repeated escalation | Surface to user | Block until human unblocks |
+
+Retry limits prevent infinite loops. Escalation path is always upward, never sideways.
+
+---
+
+## Agent Talent Pool
+
+The system builds on [agency-agents](https://github.com/msitarzewski/agency-agents) — a library of 50+ pre-built specialist personalities, each with deep domain expertise, quality standards, and specific deliverables.
+
+**Division of responsibility:**
+- Our system provides: orchestration, tier structure, task briefs, retries, verification gates, shared state
+- Agency-agents provides: the specialist knowledge each agent brings to its role
+
+T1 selects the right specialist from the roster when building workstream briefs. The specialist's personality is injected as the system prompt at spawn time.
+
+**Default tier-to-specialist mapping for software pipelines:**
+
+| Tier | Domain | Agent |
+|------|--------|-------|
+| T1 | Strategy | nexus-strategy |
+| T2 | Backend | software-architect |
+| T2 | Infra | devops-automator |
+| T2 | Data | data-engineer |
+| T3 | Backend | senior-developer |
+| T3 | Reliability | sre |
+| T4 | Frontend | frontend-developer |
+| T4 | Backend | backend-architect |
+| T4 | Database | database-optimizer |
+| T4 | DevOps | devops-automator |
+| T4 | Mobile | mobile-app-builder |
+| T4 | AI/ML | ai-engineer |
+| T4 | Security | security-engineer |
+| T4 | Docs | technical-writer |
+| T5 | Code review | code-reviewer |
+| T5 | Integration | testing-reality-checker |
+| T5 | API | testing-api-tester |
+| T5 | Performance | testing-performance-benchmarker |
+| T5 | Security | security-engineer |
+
+The roster is not fixed — T1 can select any agent from the library based on workstream needs. Non-engineering agents (design, marketing, product) extend the system to non-software pipelines.
+
+---
+
+## Adapter Layers
+
+Everything external is a swappable adapter. Core logic never imports from adapters directly — always through an interface.
+
+```
+Core (platform-agnostic)
+├── team_runner      — run lifecycle, agent spawning, runtime selection
+├── blackboard       — SQLite coordination state
+├── task_brief       — schema + validation
+└── escalation       — retry logic, failure routing
+
+Adapters (swappable)
+├── llm/             — anthropic (now), openai, ollama, any API
+├── notify/          — openclaw (now), slack, email, webhook...
+├── vcs/             — github (now), gitlab, gitea, bare git...
+└── runtime/
+    ├── standard     — openclaw sessions_spawn (T1/T2/T3)
+    └── coding_agent — claude_code (T4/T5 default), codex, aider...
+```
+
+Swapping providers means writing a new adapter file — nothing in core changes.
+
+T4 and T5 default to the **coding agent runtime** when available. It provides direct file system access, git operations, and test execution — no need to shuttle file contents through message context. Falls back to standard runtime gracefully if not configured.
+
+---
+
+## Decisions
+
+**Depth decision** — T1 assesses scope on receipt and determines how many tiers to engage. Not pre-configured per task type.
+
+**Trigger mechanism** — User messages Hans → Hans spins up T1 with the goal. T1 takes it from there.
+
+**Output / review** — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew for review. Merge is gated on human sign-off. Notification is dual: Hans messages Andrew directly, and a PR is opened on the VCS platform so Andrew gets notified natively too. This keeps the review step platform-independent — whichever VCS is in use, Hans always notifies Andrew directly as a fallback.
+
+**Retry limits** — Three failure types, handled differently:
+- *Bad output* → retry T4 with a corrected brief (default: 3 retries)
+- *Blocked* → escalate immediately, no retries
+- *Partial output* → salvage good parts, re-task the remainder
+
+T1 sets a retry budget multiplier during scope assessment (`1x` simple, `2x` complex). Retry budget is a field on the task brief — not hardcoded in the runner.
+
+**Platform agnosticism** — Core logic is provider and platform agnostic. LLMs, VCS, notifications, and agent runtimes are all adapters. Tiers reference capability levels (`reasoning-heavy`, `capable`, `fast-cheap`), not specific model names. Provider-to-model mapping lives in config.
+
+**LLM provider** — Anthropic first implementation. Config supports per-tier provider selection and mixing providers across tiers (e.g. T1 on OpenAI o3, T4 workers on local Ollama).
+
+**Gateway modification** — Decided against. Agent-teams stays standalone Python. OpenClaw is used as the runtime adapter via existing primitives (sessions_spawn, sessions_send, subagents) — called through a skill layer. No gateway fork. Keeps platform agnosticism intact and avoids Node/Python mismatch and fork maintenance burden.
+
+**Coding agent runtime** — Claude Code is the default T4/T5 runtime for software pipelines. It is purpose-built for implementation and verification: direct file access, git ops, test execution. Enters as a runtime adapter — swappable for Codex, Aider, or any equivalent. T1/T2/T3 always use the standard runtime (they reason, they don't edit files).
+
+**Claude Code native teams** — Claude Code has an experimental agent teams feature that fans out sub-agents internally within a session. Integrated as an opt-in flag (`native_teams: true`) in the coding_agent runtime adapter. When enabled, T3 hands a full workstream to Claude Code and it parallelises internally — faster, but less granular blackboard visibility. Default is `false` — explicit T4 spawning is the baseline; native teams is a speed optimisation to enable deliberately.
+
+**Agency-agents integration** — Agent personalities sourced from [msitarzewski/agency-agents](https://github.com/msitarzewski/agency-agents) via git submodule. Included as `agents/` in the repo. T1 selects specialists from the roster via `config/role_registry.yaml`. Each task brief carries an `agent_personality` field (path to the agent .md file) which the runtime adapter injects as the system prompt at spawn time. Adding new specialists means adding an entry to the registry — no core changes required.