docs: add design doc and buildspec (#5)

2026-03-16 15:51:14 -04:00
parent 084cfb0bb2
commit 72bd744664
2 changed files with 645 additions and 0 deletions
@@ -0,0 +1,437 @@
 # Tiered Agent Team System — Build Spec
 _Started: 2026-03-15. Status: Pre-build._
 _See agent-teams-design.md for the design doc and decisions log._
 ---
 ## Language & Runtime
 **Python 3.11+.** Reasons:
 - Agent/AI tooling is Python-first
 - Clean type hints + dataclasses for schemas
 - Agents can read and modify their own orchestration code
 - Runs anywhere — no Node, no OpenClaw dependency
 ---
 ## Repository
 Standalone repo: `[email protected]:coding-with-hans-heinemann/the-agency.git`
 Separate from the OpenClaw workspace. OpenClaw workspace gets a thin integration layer that calls into it. Core is portable and runnable without OpenClaw.
 ---
 ## Directory Structure
 ```
 agent-teams/
 ├── core/
 │   ├── team_runner.py       — run lifecycle, agent spawning
 │   ├── blackboard.py        — SQLite coordination state
 │   ├── task_brief.py        — schema + validation
 │   └── escalation.py        — retry logic, failure routing
 │
 ├── adapters/
 │   ├── base/
 │   │   ├── llm.py           — abstract LLM interface
 │   │   ├── vcs.py           — abstract VCS interface
 │   │   ├── notify.py        — abstract notification interface
 │   │   └── runtime.py       — abstract agent runtime interface
 │   ├── llm/
 │   │   ├── anthropic.py     — Claude via OpenClaw or direct API
 │   │   ├── openai.py        — GPT / o-series
 │   │   └── ollama.py        — local models
 │   ├── vcs/
 │   │   └── github.py
 │   ├── notify/
 │   │   └── openclaw.py      — messages Hans who notifies Andrew
 │   └── runtime/
 │       ├── openclaw.py      — sessions_spawn (general purpose)
 │       └── claude_code.py   — coding agent runtime (file/git/exec tools)
 │
 ├── agents/                  — git submodule: msitarzewski/agency-agents
 │   ├── engineering/
 │   ├── testing/
 │   ├── strategy/
 │   └── ...                  — full agency-agents roster
 │
 ├── prompts/
 │   ├── t1_visionary.md      — fallback if no agent_personality set
 │   ├── t2_architect.md
 │   ├── t3_squad_lead.md
 │   ├── t4_implementer.md
 │   └── t5_verifier.md
 │
 ├── config/
 │   ├── team.yaml            — example run configuration
 │   └── role_registry.yaml   — maps (tier, domain) → agent personality file
 │
 ├── runs/                    — runtime state, one subdir per run_id
 │   └── .gitkeep
 │
 └── README.md
 ```
 ---
 ## Blackboard
 SQLite. One file per run at `runs/<run_id>/blackboard.db`.
 ### Tables
 **runs**
 ```sql
 CREATE TABLE runs (
    run_id      TEXT PRIMARY KEY,
    goal        TEXT NOT NULL,
    status      TEXT NOT NULL,  -- pending | active | review | done | failed
    created_at  TEXT NOT NULL,
    updated_at  TEXT NOT NULL
 );
 ```
 **workstreams**
 ```sql
 CREATE TABLE workstreams (
    workstream_id   TEXT PRIMARY KEY,
    run_id          TEXT NOT NULL,
    name            TEXT NOT NULL,
    tier            INTEGER NOT NULL,
    status          TEXT NOT NULL,  -- pending | active | blocked | done | failed
    owner_agent_id  TEXT,
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL
 );
 ```
 **briefs**
 ```sql
 CREATE TABLE briefs (
    brief_id        TEXT PRIMARY KEY,
    run_id          TEXT NOT NULL,
    parent_brief_id TEXT,
    workstream_id   TEXT,
    tier            INTEGER NOT NULL,
    role            TEXT NOT NULL,
    status          TEXT NOT NULL,  -- pending | active | done | failed
    payload         TEXT NOT NULL,  -- full JSON brief
    result          TEXT,           -- JSON result when done
    retry_count     INTEGER DEFAULT 0,
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL
 );
 ```
 **events**
 ```sql
 CREATE TABLE events (
    event_id    TEXT PRIMARY KEY,
    run_id      TEXT NOT NULL,
    brief_id    TEXT,
    kind        TEXT NOT NULL,  -- spawned | completed | failed | escalated | retried
    detail      TEXT,           -- JSON
    created_at  TEXT NOT NULL
 );
 ```
 ---
 ## Task Brief Schema
 Every brief passed between tiers is a validated JSON object. `goal_anchor` is immutable — set by T1, copied verbatim into every downstream brief.
 ```json
 {
  "brief_id": "uuid",
  "run_id": "uuid",
  "parent_brief_id": "uuid | null",
  "tier": 4,
  "role": "implementer",
  "goal_anchor": "Original T1 intent — always propagated unchanged",
  "workstream": "backend-api",
  "task": "Implement POST /webhooks/ingest endpoint",
  "acceptance_criteria": [
    "Accepts JSON payload",
    "Returns 202 on success",
    "Writes to queue"
  ],
  "constraints": [
    "Use existing queue client in src/queue.py",
    "No new dependencies"
  ],
  "context": {
    "relevant_files": ["src/routes/webhooks.py", "src/queue.py"],
    "interface_contract": "..."
  },
  "retry_budget": 3,
  "retry_count": 0,
  "preferred_runtime": "coding_agent",
  "agent_personality": "agents/engineering/engineering-code-reviewer.md",
  "created_at": "ISO-8601"
 }
 ```
 `preferred_runtime` is optional. T3 sets it to `"coding_agent"` when spawning T4/T5 for implementation or verification tasks. Runner falls back to `"standard"` if the coding agent runtime is not configured.
 `agent_personality` is optional. When set, the runtime adapter reads the file and injects its contents as the system prompt at spawn time. Falls back to the generic tier prompt in `prompts/` if not set.
 ```
 ```
 ---
 ## Adapter Interfaces
 ### LLM (`adapters/base/llm.py`)
 ```python
 class LLMAdapter:
    def complete(self, prompt: str, capability: str, context: dict) -> str
    def resolve_model(self, capability: str) -> str
    # capability: "reasoning-heavy" | "capable" | "fast-cheap"
 ```
 ### VCS (`adapters/base/vcs.py`)
 ```python
 class VCSAdapter:
    def create_branch(self, name: str) -> None
    def commit(self, files: list[str], message: str) -> str       # returns commit sha
    def create_pr(self, title: str, body: str, head: str, base: str) -> str  # returns pr url
    def get_pr_status(self, pr_id: str) -> str                    # open | merged | closed
 ```
 ### Notify (`adapters/base/notify.py`)
 ```python
 class NotifyAdapter:
    def send(self, message: str, context: dict) -> None
 ```
 ### Runtime (`adapters/base/runtime.py`)
 ```python
 class RuntimeAdapter:
    def spawn(self, task: str, capability: str, context: dict) -> str  # returns agent_id
    def get_result(self, agent_id: str, timeout_s: int) -> dict
    def kill(self, agent_id: str) -> None
 # Two implementations:
 #   openclaw.py    — general purpose, uses sessions_spawn, suits T1/T2/T3
 #   claude_code.py — coding-specialized, has file/git/exec tools, suits T4/T5
 #
 # The runner selects runtime based on brief.preferred_runtime:
 #   "standard"      → openclaw.py (default)
 #   "coding_agent"  → claude_code.py (falls back to standard if unavailable)
 #
 # Both implementations inject brief.agent_personality as the system prompt
 # when spawning, if present. Falls back to generic tier prompt otherwise.
 # claude_code.py passes the agent file via --system-prompt flag natively
 # (agency-agents was designed for Claude Code's agents/ directory).
 ```
 ---
 ## Run Config (`config/team.yaml`)
 ```yaml
 run:
  goal: "Build webhook ingestion system with retry logic and DLQ"
  repo: "[email protected]:org/repo.git"
  base_branch: "main"
 adapters:
  llm: anthropic
  vcs: github
  notify: openclaw
  runtime: openclaw
 models:
  provider: anthropic          # default provider
  capability_map:
    reasoning-heavy:
      anthropic: claude-opus-4-6
      openai: o3
    capable:
      anthropic: claude-sonnet-4-6
      openai: gpt-4o
      ollama: llama3.1:70b
    fast-cheap:
      anthropic: claude-haiku-3-5
      openai: gpt-4o-mini
      ollama: llama3.2
  # optional: override provider per tier
  tier_overrides:
    t1: { provider: openai, capability: reasoning-heavy }
    t4: { provider: ollama, capability: fast-cheap }
 runtime:
  default: openclaw
  coding_agent: claude_code     # used for T4/T5 when available; omit to disable
  native_teams: false           # Claude Code's experimental agent teams — opt-in only
                                # when true: T3 hands full workstream to Claude Code,
                                # which fans out internally. faster but less blackboard
                                # visibility. default: false (explicit T4 spawning)
  # tier_runtime_map (optional overrides):
  #   t1: standard
  #   t2: standard
  #   t3: standard
  #   t4: coding_agent
  #   t5: coding_agent
 retry_defaults:
  bad_output: 3
  partial: 2
  blocked: 0    # always escalate immediately
 ```
 ---
 ## Role Registry (`config/role_registry.yaml`)
 Maps `(tier, domain)` → agent personality file. T1 consults this during scope assessment when selecting specialists for each workstream brief. Adding a new specialist means adding one entry here — no core changes.
 ```yaml
 t1:
  default: agents/strategy/nexus-strategy.md
 t2:
  backend:  agents/engineering/engineering-software-architect.md
  frontend: agents/engineering/engineering-software-architect.md
  infra:    agents/engineering/engineering-devops-automator.md
  data:     agents/engineering/engineering-data-engineer.md
  default:  agents/engineering/engineering-software-architect.md
 t3:
  backend:  agents/engineering/engineering-senior-developer.md
  frontend: agents/engineering/engineering-senior-developer.md
  infra:    agents/engineering/engineering-sre.md
  default:  agents/engineering/engineering-senior-developer.md
 t4:
  frontend:  agents/engineering/engineering-frontend-developer.md
  backend:   agents/engineering/engineering-backend-architect.md
  database:  agents/engineering/engineering-database-optimizer.md
  devops:    agents/engineering/engineering-devops-automator.md
  mobile:    agents/engineering/engineering-mobile-app-builder.md
  ai:        agents/engineering/engineering-ai-engineer.md
  security:  agents/engineering/engineering-security-engineer.md
  docs:      agents/engineering/engineering-technical-writer.md
  default:   agents/engineering/engineering-senior-developer.md
 t5:
  code:        agents/engineering/engineering-code-reviewer.md
  integration: agents/testing/testing-reality-checker.md
  api:         agents/testing/testing-api-tester.md
  performance: agents/testing/testing-performance-benchmarker.md
  security:    agents/engineering/engineering-security-engineer.md
  default:     agents/engineering/engineering-code-reviewer.md
 ```
 ```yaml
 ```
 ---
 ## Key Flows
 ### 1. Run Kickoff
 ```
 User → Hans → team_runner.start(goal, config)
  → generate run_id
  → init blackboard (create runs/<run_id>/blackboard.db)
  → build T1 brief (goal_anchor = goal, retry_budget from config)
  → spawn T1 via runtime adapter
  → await T1 workplan
 ```
 ### 2. T1 Scope Assessment
 ```
 T1 receives brief
  → assess complexity → decide depth
  → identify workstreams
  → set retry_budget multiplier per workstream (1x simple, 2x complex)
  → emit N workstream briefs for T2 (or T3 if shallow)
  → write workplan to blackboard
  → team_runner spawns T2s in parallel
 ```
 ### 3. T4 Retry Loop (escalation.py)
 ```
 spawn T4 with brief
  → receive result
  → classify: bad_output | blocked | partial | success
  blocked:
    → log event(escalated)
    → pass to T3 immediately
  bad_output, retries_remaining:
    → amend brief with failure context, increment retry_count
    → re-spawn T4
    → log event(retried)
  bad_output, retries_exhausted:
    → log event(escalated)
    → pass to T3
  partial:
    → write salvageable parts to blackboard
    → re-task remainder with new brief
  success:
    → write result to blackboard
    → log event(completed)
    → notify T3
 ```
 ### 4. Review Gate
 ```
 T1 completes integration
  → vcs_adapter.create_pr(
      title="[agent-teams] <run_id>: <goal summary>",
      body="<workplan + workstream summaries>",
      head="integration/<run_id>",
      base="main"
    )
  → notify_adapter.send(
      "Run <run_id> complete. PR ready for review: <pr_url>",
      context={run_id, goal, workstreams, pr_url}
    )
  → blackboard: update run status → "review"
  → halt — no auto-merge
 ```
 ---
 ## Build Order
 1. `git submodule add https://github.com/msitarzewski/agency-agents agents/` — pull the talent pool
 2. `config/role_registry.yaml` — map tier+domain → agent personality files
 3. `core/task_brief.py` — schema + validation (everything depends on this)
 4. `core/blackboard.py` — SQLite store, all table definitions
 5. `adapters/base/*` — all four abstract interfaces
 6. `adapters/llm/anthropic.py` — first LLM implementation
 7. `core/escalation.py` — retry + failure routing logic
 8. `adapters/runtime/openclaw.py` — wire up sessions_spawn + personality injection
 9. `adapters/runtime/claude_code.py` — coding agent runtime, personality via --system-prompt
 10. `core/team_runner.py` — full run lifecycle, runtime + personality selection
 11. `prompts/` — fallback tier prompts (used when no agent_personality set)
 12. `adapters/vcs/github.py` — PR creation + branch management
 13. `adapters/notify/openclaw.py` — Hans notification
 14. `config/team.yaml` — example config
 15. `README.md` — how to run, how to add adapters, how to extend the roster
 ---
 ## Out of Scope (Phase 2)
 - Cost accounting per tier + run rollup
 - Parallel workstream progress dashboard
 - Additional adapter implementations (GitLab, Slack, OpenAI, Ollama)
 - Persistent standing teams
 - Web UI for run monitoring
@@ -0,0 +1,208 @@
 # Tiered Agent Team System — Design Document
 _Started: 2026-03-14. Status: Pre-build, gathering requirements._
 ---
 ## Overview
 A dynamic, hierarchical multi-agent system for software pipelines. Teams assemble on demand, execute, then disband. Inspired by a blend of Hollywood production (dynamic assembly), consulting firms (structured deliverables, hierarchical synthesis), and two-pizza teams (small autonomous squads, clear domain ownership).
 ---
 ## Core Principles
 **1. Tiers represent cognitive modes, not org chart levels.**
 Each tier thinks differently — strategy, design, coordination, execution, verification. Adding a tier only makes sense if it introduces a genuinely different mode of reasoning.
 **2. Depth is proportional to complexity.**
 Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack.
 **3. Goal anchoring at every level.**
 T1's original intent is embedded in every agent's context — not just passed to T2 and forgotten. Every agent knows the end goal even if they only own a slice.
 **4. Artifacts, not summaries.**
 Tiers pass structured specs downward (JSON task briefs), not paraphrased prose. Meaning is preserved; format is compressed.
 **5. Verification is bidirectional.**
 Lower tiers verify correctness. Upper tiers verify alignment with original intent. Both directions catch different failure modes.
 **6. Provider agnostic.**
 The system makes no assumptions about which LLM provider or platform is in use. Tiers reference capability levels, not specific models. All external dependencies are swappable adapters.
 **7. Specialist talent pool.**
 Tiers define structure and responsibility. Agent personalities define domain expertise. The two are separate — the same tier can be filled by different specialists depending on the workstream domain.
 ---
 ## Tier Definitions
 | Tier | Role | Owns | Capability Level |
 |------|------|------|-----------------|
 | T1 | Visionary | Goal, constraints, final acceptance, architectural bets | reasoning-heavy |
 | T2 | Architect | System design, interface contracts, workstream boundaries | reasoning-heavy / capable |
 | T3 | Squad Lead | Workstream delivery, worker coordination, quality gate | capable |
 | T4 | Implementer | Atomic task execution (one file, one function, one test) | fast-cheap |
 | T5 | Verifier | Validation of T4 output — correctness + intent alignment | capable |
 T5 runs **parallel to T4**, not above it. It's a quality gate, not a management layer.
 Capability levels map to actual models per provider in config — the core system never references a specific model name.
 ---
 ## Variable Depth
 ```
 Config change          T3 → T4
 New feature            T2 → T3 → T4
 Major refactor         T1 → T2 → T3 → T4 → T5
 New system / product   T1 → T2 → T3s (parallel) → T4s → T5s
 ```
 T3 assesses scope on receipt. If a task is simple enough, it handles it directly without spawning upward or waiting for T2 sign-off.
 ---
 ## Horizontal Scaling Within Tiers
 Each tier can have multiple agents running in parallel:
 ```
 T1 (1–2 agents)
 ├── T2: Backend Architect
 │   ├── T3: API Squad Lead
 │   │   ├── T4: Worker — endpoint A
 │   │   ├── T4: Worker — endpoint B
 │   │   └── T5: Verifier
 │   └── T3: DB Squad Lead
 │       ├── T4: Worker — migrations
 │       └── T5: Verifier
 ├── T2: Frontend Architect
 │   └── T3: UI Squad Lead
 │       ├── T4: Worker — component X
 │       └── T4: Worker — component Y
 └── T2: Infra Architect
    └── T3: Platform Squad Lead
        └── T4: Worker — config / deploy
 ```
 ---
 ## Shared State
 For software pipelines, **the repo is the primary blackboard**:
 - T4 workers commit to feature branches
 - T3 leads review and merge to workstream branches
 - T2 architects own integration branches
 - T1 does final integration and acceptance
 Supplemented by a SQLite coordination store per run tracking in-flight workstreams, handoff artifacts, tier status, and retry counts.
 ---
 ## Failure Handling
 | Failure | Handler | Action |
 |---------|---------|--------|
 | T4 bad output | T3 | Retry T4 with corrected brief (up to retry_budget) |
 | T4 blocked | T3 | Escalate immediately — no retries |
 | T4 partial output | T3 | Salvage good parts, re-task remainder |
 | T3 workstream stuck | T2 | Re-scope or split the workstream |
 | T2 design wrong | T1 | Re-plan; may discard workstream and restart |
 | Repeated escalation | Surface to user | Block until human unblocks |
 Retry limits prevent infinite loops. Escalation path is always upward, never sideways.
 ---
 ## Agent Talent Pool
 The system builds on [agency-agents](https://github.com/msitarzewski/agency-agents) — a library of 50+ pre-built specialist personalities, each with deep domain expertise, quality standards, and specific deliverables.
 **Division of responsibility:**
 - Our system provides: orchestration, tier structure, task briefs, retries, verification gates, shared state
 - Agency-agents provides: the specialist knowledge each agent brings to its role
 T1 selects the right specialist from the roster when building workstream briefs. The specialist's personality is injected as the system prompt at spawn time.
 **Default tier-to-specialist mapping for software pipelines:**
 | Tier | Domain | Agent |
 |------|--------|-------|
 | T1 | Strategy | nexus-strategy |
 | T2 | Backend | software-architect |
 | T2 | Infra | devops-automator |
 | T2 | Data | data-engineer |
 | T3 | Backend | senior-developer |
 | T3 | Reliability | sre |
 | T4 | Frontend | frontend-developer |
 | T4 | Backend | backend-architect |
 | T4 | Database | database-optimizer |
 | T4 | DevOps | devops-automator |
 | T4 | Mobile | mobile-app-builder |
 | T4 | AI/ML | ai-engineer |
 | T4 | Security | security-engineer |
 | T4 | Docs | technical-writer |
 | T5 | Code review | code-reviewer |
 | T5 | Integration | testing-reality-checker |
 | T5 | API | testing-api-tester |
 | T5 | Performance | testing-performance-benchmarker |
 | T5 | Security | security-engineer |
 The roster is not fixed — T1 can select any agent from the library based on workstream needs. Non-engineering agents (design, marketing, product) extend the system to non-software pipelines.
 ---
 ## Adapter Layers
 Everything external is a swappable adapter. Core logic never imports from adapters directly — always through an interface.
 ```
 Core (platform-agnostic)
 ├── team_runner      — run lifecycle, agent spawning, runtime selection
 ├── blackboard       — SQLite coordination state
 ├── task_brief       — schema + validation
 └── escalation       — retry logic, failure routing
 Adapters (swappable)
 ├── llm/             — anthropic (now), openai, ollama, any API
 ├── notify/          — openclaw (now), slack, email, webhook...
 ├── vcs/             — github (now), gitlab, gitea, bare git...
 └── runtime/
    ├── standard     — openclaw sessions_spawn (T1/T2/T3)
    └── coding_agent — claude_code (T4/T5 default), codex, aider...
 ```
 Swapping providers means writing a new adapter file — nothing in core changes.
 T4 and T5 default to the **coding agent runtime** when available. It provides direct file system access, git operations, and test execution — no need to shuttle file contents through message context. Falls back to standard runtime gracefully if not configured.
 ---
 ## Decisions
 **Depth decision** — T1 assesses scope on receipt and determines how many tiers to engage. Not pre-configured per task type.
 **Trigger mechanism** — User messages Hans → Hans spins up T1 with the goal. T1 takes it from there.
 **Output / review** — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew for review. Merge is gated on human sign-off. Notification is dual: Hans messages Andrew directly, and a PR is opened on the VCS platform so Andrew gets notified natively too. This keeps the review step platform-independent — whichever VCS is in use, Hans always notifies Andrew directly as a fallback.
 **Retry limits** — Three failure types, handled differently:
 - *Bad output* → retry T4 with a corrected brief (default: 3 retries)
 - *Blocked* → escalate immediately, no retries
 - *Partial output* → salvage good parts, re-task the remainder
 T1 sets a retry budget multiplier during scope assessment (`1x` simple, `2x` complex). Retry budget is a field on the task brief — not hardcoded in the runner.
 **Platform agnosticism** — Core logic is provider and platform agnostic. LLMs, VCS, notifications, and agent runtimes are all adapters. Tiers reference capability levels (`reasoning-heavy`, `capable`, `fast-cheap`), not specific model names. Provider-to-model mapping lives in config.
 **LLM provider** — Anthropic first implementation. Config supports per-tier provider selection and mixing providers across tiers (e.g. T1 on OpenAI o3, T4 workers on local Ollama).
 **Gateway modification** — Decided against. Agent-teams stays standalone Python. OpenClaw is used as the runtime adapter via existing primitives (sessions_spawn, sessions_send, subagents) — called through a skill layer. No gateway fork. Keeps platform agnosticism intact and avoids Node/Python mismatch and fork maintenance burden.
 **Coding agent runtime** — Claude Code is the default T4/T5 runtime for software pipelines. It is purpose-built for implementation and verification: direct file access, git ops, test execution. Enters as a runtime adapter — swappable for Codex, Aider, or any equivalent. T1/T2/T3 always use the standard runtime (they reason, they don't edit files).
 **Claude Code native teams** — Claude Code has an experimental agent teams feature that fans out sub-agents internally within a session. Integrated as an opt-in flag (`native_teams: true`) in the coding_agent runtime adapter. When enabled, T3 hands a full workstream to Claude Code and it parallelises internally — faster, but less granular blackboard visibility. Default is `false` — explicit T4 spawning is the baseline; native teams is a speed optimisation to enable deliberately.
 **Agency-agents integration** — Agent personalities sourced from [msitarzewski/agency-agents](https://github.com/msitarzewski/agency-agents) via git submodule. Included as `agents/` in the repo. T1 selects specialists from the roster via `config/role_registry.yaml`. Each task brief carries an `agent_personality` field (path to the agent .md file) which the runtime adapter injects as the system prompt at spawn time. Adding new specialists means adding an entry to the registry — no core changes required.