the-agency/docs/design.md

# Tiered Agent Team System — Design Document

_Started: 2026-03-14. Last updated: 2026-03-16 (evening)._

---

## Open Design Questions

The following areas are identified but not yet resolved. Work through these before implementing `core/team_runner.py`.

1. **T3 mesh mechanics** — How do T3s within the same T2 domain coordinate? Via blackboard, direct message exchange, or a designated T3 lead? What does "negotiate task boundaries" look like concretely?

2. **T1 output schema** — What does T1's Plan phase output look like as structured data? Needs a formal schema: workstreams, tier paths, parallelism flags, retry budget, T2 specialist list. This is what the runner parses to bootstrap the pipeline.

3. **T5 consensus mechanics** — Individual T5s review their slice and produce results. Who aggregates? What does the joint verdict look like as structured data? What happens on split verdict (some T5s pass, some fail)?

4. **Path amendment mechanism** — When a mid-run tier proposes a path amendment, what's the concrete mechanism? Who writes to the blackboard, in what format, and how does the relevant higher tier get notified?

5. **Failure handling (distributed model)** — The current failure table assumes centralised runner handling. Needs to be rewritten to reflect distributed ownership: T3 handles T4 failures, T2 handles T3 failures, T1 handles T2 failures. Runner only handles T1 failure and terminal escalation to human.

---

---

## Overview

A dynamic, hierarchical multi-agent system for software pipelines. Teams assemble on demand, execute, then disband. Inspired by a blend of Hollywood production (dynamic assembly), consulting firms (structured deliverables, hierarchical synthesis), and two-pizza teams (small autonomous squads, clear domain ownership).

---

## Core Principles

**1. Tiers represent cognitive modes, not org chart levels.**
Each tier thinks differently — strategy, design, coordination, execution, verification. Adding a tier only makes sense if it introduces a genuinely different mode of reasoning.

**2. Depth is proportional to complexity.**
Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack. T1 assesses scope and prescribes the path — it is never pre-configured.

**3. Goal anchoring at every level.**
T1's original intent is embedded in every agent's context — not just passed to T2 and forgotten. Every agent knows the end goal even if they only own a slice.

**4. Artifacts, not summaries.**
Tiers pass structured specs downward (JSON task briefs), not paraphrased prose. Meaning is preserved; format is compressed.

**5. Verification is mandatory.**
T5 always runs. Nothing returns to T1 unverified. T5 is a quality gate, not optional — things should work and work well before they surface upward.

**6. Provider agnostic.**
The system makes no assumptions about which LLM provider or platform is in use. Tiers reference capability levels, not specific models. All external dependencies are swappable adapters.

**7. Specialist talent pool.**
Tiers define structure and responsibility. Agent personalities define domain expertise. The two are separate — the same tier can be filled by different specialists depending on the workstream domain.

---

## Tier Definitions

| Tier | Role | Owns | Capability Level |
|------|------|------|-----------------|
| T1 | Visionary | Goal, constraints, dispatch plan, final acceptance | reasoning-heavy |
| T2 | Architect | System design, interface contracts, workstream boundaries | reasoning-heavy / capable |
| T3 | Squad Lead | Workstream delivery, T4 management, quality gate | capable |
| T4 | Implementer | Atomic task execution (one file, one function, one test) | fast-cheap |
| T5 | Verifier | Validation of T4 output — correctness + intent alignment | capable |

T5 runs **within T3's scope**, not above it. T3 commissions T5 verification of its T4 outputs. T5 is a quality gate, not a management layer.

Capability levels map to actual models per provider in config — the core system never references a specific model name.

---

## Dispatch Model

### T1 Owns the Plan

T1 is not just a decomposer — it is the dispatch planner. Its output declares:

- **Workstreams** — the decomposed units of work
- **Tier path per workstream** — which tiers to engage (e.g. `[T2, T3, T4, T5]` or `[T4, T5]` for trivial tasks)
- **Parallelism** — which workstreams are independent and can run concurrently

T1 does not prescribe how each tier operates internally. That is the tier's own concern.

### T1 Lifecycle — Two Explicit Phases

T1 is invoked twice per run, each with a distinct prompt and purpose:

**Phase 1 — Plan:**
1. T1 produces initial dispatch plan (workstreams, tier paths, parallelism, retry budget)
2. T1 self-critiques its own plan in a single follow-up pass ("what could go wrong, what did I miss?") and amends
3. Amended plan surfaces to Andrew for approval — no T2s spawn until approval is given

**Phase 2 — Accept:**
After the full T2→T3→T4→T5 pipeline completes, T1 is re-invoked with the final output. It validates against the original goal anchor and either accepts (opens PR) or rejects (escalates back down).

Both phases are named explicitly in the task brief schema and tracked on the blackboard.

### Each Tier Owns the Layer Below

Control flow is distributed, not centralised:

- T1 manages its T2s
- T2 Lead manages T2 specialists and their domain boundaries
- T2 specialists each own their T3s
- **T3 manages its T4s** — including dependency graph, parallelism, and T5 commissioning
- The runner is thin: bootstrap T1, monitor the blackboard, handle final result and notifications

This means orchestration logic lives in agent prompts and output schemas — not in Python runner code. Adding a new execution pattern means updating a prompt, not the runner.

**Tradeoff:** Debugging is harder. When something fails mid-chain, you read blackboard logs rather than step through central runner code. This is a tooling problem to solve (good blackboard inspection), not a design flaw to avoid.

### Dynamic Paths

Tiers can propose path amendments mid-run (e.g. T3 discovers scope that warrants a T2 pass it didn't get). Amendments are logged to the blackboard. Higher tiers are notified but do not need to approve — it is informational. No tier silently changes the plan.

---

## Orchestration Patterns Per Tier

Different tiers suit different internal coordination patterns. These are baked into the runner's tier-handling logic and the tier prompts — not prescribed by T1.

| Tier | Pattern | Rationale |
|------|---------|-----------|
| T1 | Single agent, two phases | Must be authoritative; plan phase + accept phase |
| T2 Lead | Coordinator | Spawned first; defines boundaries + shared assumptions; drives conflict resolution; produces canonical architecture |
| T2 Specialists | Parallel fan-out | Each works independently within its domain; reads Lead's boundaries + shared assumptions doc before starting |
| T3 | Light mesh | Peer coordination within same T2 domain to negotiate task boundaries before T4 dispatch |
| T4 | Swarm + pipeline hybrid | Independent tasks run as swarm; dependent tasks pipeline (T4-A's output feeds T4-B). T3 declares which is which. |
| T5 | Parallel fan-out + consensus | Each T5 reviews its slice independently, then compares notes for a joint verdict — catches both artifact bugs and integration issues |

### T2 Flow in Detail

1. T1 spawns **T2 Lead Architect** with goal + workstream context
2. Lead defines explicit **domain boundaries** (who owns what, hard edges)
3. Lead publishes **shared assumptions doc** — cross-cutting concerns, key conventions, architectural constraints (auth approach, data formats, API patterns, etc.)
4. T1 spawns **T2 specialists** with boundaries + shared assumptions baked into their briefs
5. Specialists work in parallel, each within their defined domain
6. Lead reads all proposals, drives **conflict resolution** with relevant specialists if needed (cycle limit in config — fixed, not per-workstream)
7. Lead produces **canonical architecture** → written to blackboard as distinct artifact
8. T1 (Accept phase) validates canonical architecture against goal anchor
9. Canonical architecture becomes T3 briefs — each T2 specialist hands off to its own T3s

---

## Horizontal Scaling Within Tiers

```
T1 — Phase 1: Plan (self-critique → Andrew approval)
│
├── T2: Lead Architect (boundaries + shared assumptions first)
│   ├── T2: Backend Architect  ─┐
│   ├── T2: Frontend Architect  ├─ parallel, within defined domains
│   └── T2: Infra Architect    ─┘
│       │
│       └── (Lead synthesises → conflict resolution if needed → canonical architecture)
│
├── T2 Backend Architect owns:
│   ├── T3: API Squad Lead  ─┐
│   └── T3: DB Squad Lead   ─┴─ light mesh within domain
│           ├── T4: Worker A  ─┐
│           ├── T4: Worker B  ─┼─ swarm / pipeline (T3 decides)
│           └── T4: Worker C  ─┘
│                   └── T5: Verifier(s) — fan-out + consensus
│
└── T1 — Phase 2: Accept (validates against goal anchor → PR)
```

---

## Shared State

For software pipelines, **the repo is the primary blackboard**:
- T4 workers commit to feature branches
- T3 leads review and merge to workstream branches
- T2 architects own integration branches
- T1 does final integration and acceptance

Supplemented by a SQLite coordination store per run tracking:
- In-flight workstreams and their current execution plans
- Handoff artifacts and tier status
- Retry counts and escalation history
- Path amendments (proposed, by whom, timestamp)

---

## Failure Handling

| Failure | Handler | Action |
|---------|---------|--------|
| T4 bad output | T3 | Retry T4 with corrected brief (up to retry_budget) |
| T4 blocked | T3 | Escalate immediately — no retries |
| T4 partial output | T3 | Salvage good parts, re-task remainder |
| T3 workstream stuck | T2 | Re-scope or split the workstream |
| T2 design wrong | T1 | Re-plan; may discard workstream and restart |
| Repeated escalation | Surface to user | Block until human unblocks |

Retry limits prevent infinite loops. Escalation path is always upward, never sideways.

T1 sets a retry budget multiplier during scope assessment (`1x` simple, `2x` complex). Retry budget is a field on the task brief — not hardcoded in the runner.

---

## Agent Talent Pool

The system builds on [agency-agents](https://github.com/msitarzewski/agency-agents) — a library of 50+ pre-built specialist personalities, each with deep domain expertise, quality standards, and specific deliverables.

**Division of responsibility:**
- Our system provides: orchestration, tier structure, task briefs, retries, verification gates, shared state
- Agency-agents provides: the specialist knowledge each agent brings to its role

T1 selects the right specialist from the roster when building workstream briefs. The specialist's personality is injected as the system prompt at spawn time.

**Default tier-to-specialist mapping for software pipelines:**

| Tier | Domain | Agent |
|------|--------|-------|
| T1 | Strategy | nexus-strategy |
| T2 | Backend | software-architect |
| T2 | Infra | devops-automator |
| T2 | Data | data-engineer |
| T3 | Backend | senior-developer |
| T3 | Reliability | sre |
| T4 | Frontend | frontend-developer |
| T4 | Backend | backend-architect |
| T4 | Database | database-optimizer |
| T4 | DevOps | devops-automator |
| T4 | Mobile | mobile-app-builder |
| T4 | AI/ML | ai-engineer |
| T4 | Security | security-engineer |
| T4 | Docs | technical-writer |
| T5 | Code review | code-reviewer |
| T5 | Integration | testing-reality-checker |
| T5 | API | testing-api-tester |
| T5 | Performance | testing-performance-benchmarker |
| T5 | Security | security-engineer |

The roster is not fixed — T1 can select any agent from the library based on workstream needs.

---

## Adapter Layers

Everything external is a swappable adapter. Core logic never imports from adapters directly — always through an interface.

```
Core (platform-agnostic)
├── team_runner      — thin bootstrap: spawn T1, monitor blackboard, handle result
├── blackboard       — SQLite coordination state
├── task_brief       — schema + validation
└── escalation       — retry logic, failure routing

Adapters (swappable)
├── llm/             — anthropic (now), openai, ollama, any API
├── notify/          — openclaw (now), slack, email, webhook...
├── vcs/             — github (now), gitlab, gitea, bare git...
└── runtime/
    ├── standard     — openclaw sessions_spawn (T1/T2/T3)
    └── coding_agent — claude_code (T4/T5 default), codex, aider...
```

Swapping providers means writing a new adapter file — nothing in core changes.

T4 and T5 default to the **coding agent runtime** when available. Falls back to standard runtime gracefully if not configured.

---

## Decisions Log

**T1 dynamic dispatch** — T1 assesses scope and prescribes tier path and workstream parallelism. It does not prescribe internal tier coordination patterns.

**T1 two-phase lifecycle** — T1 has two explicit named phases: Plan and Accept. Plan phase includes self-critique (single pass) then human approval gate before T2s spawn. Accept phase validates final output against goal anchor. Both phases tracked on blackboard with distinct prompts.

**T1 self-critique** — Single pass only. Diminishing returns on multiple self-critique iterations; the human review after is the real safety net. Self-critique catches obvious gaps; Andrew catches strategic ones.

**Distributed ownership** — Each tier owns the layer below it. Runner is thin. Tradeoff: distributed control makes the system extensible but debugging requires good blackboard tooling, not central runner traces.

**T5 always mandatory** — No skipping verification. Things should work and work well before surfacing to T1.

**T3 owns T4 and T5** — T3 manages its T4s (dependency graph, swarm vs pipeline, parallelism) and commissions T5 verification of T4 outputs. Runner does not orchestrate T4/T5 centrally.

**T2 Lead Architect** — Dedicated T2 role, not a new tier. Spawned first by T1. Owns: domain boundary definition, shared assumptions doc, conflict resolution between specialists, canonical architecture synthesis. Specialists spawn after Lead publishes boundaries + assumptions. Each T2 specialist owns its own T3s — no T3 spans T2 domains.

**T2 conflict resolution** — Lead sends targeted briefs back to conflicting specialists. Cycle limit is a fixed config value (not per-workstream). Single T1 self-critique parallel: fixed limit, not variable.

**T2 shared assumptions** — Lead publishes cross-cutting concerns (auth, data formats, API conventions, etc.) before specialists start. Specialists design with shared baseline; implicit dependencies pre-empted rather than caught in synthesis.

**Orchestration patterns** — Baked into tier prompts and runner tier-handling logic, not prescribed by T1. T2: Lead + parallel specialists. T3: light mesh within T2 domain. T4: swarm+pipeline. T5: fan-out+consensus.

**Output / review** — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew. Notification is dual: Hans messages Andrew directly + PR opened on VCS. Merge is gated on human sign-off.

**Platform agnosticism** — Core is provider and platform agnostic. Capability levels (`reasoning-heavy`, `capable`, `fast-cheap`) map to models in config. Mixing providers across tiers is supported.

**LLM provider** — Anthropic first implementation. Config supports per-tier provider selection.

**Gateway modification** — Decided against. Agent-teams stays standalone Python. OpenClaw used via runtime adapter only.

**Coding agent runtime** — Claude Code is default T4/T5 runtime. Opt-in `native_teams` flag available for internal Claude Code parallelism — faster but less blackboard visibility. Default `false`.

**Agency-agents integration** — Via git submodule at `agents/`. T1 selects specialists via `config/role_registry.yaml`. `agent_personality` field on task brief; runtime injects as system prompt at spawn time.