cw-hans/the-agency

Fork 0

Files

hansheinemann 72bd744664 docs: add design doc and buildspec (#5 )

2026-03-16 15:51:14 -04:00

10 KiB

Raw Blame History

Tiered Agent Team System — Design Document

Started: 2026-03-14. Status: Pre-build, gathering requirements.

Overview

A dynamic, hierarchical multi-agent system for software pipelines. Teams assemble on demand, execute, then disband. Inspired by a blend of Hollywood production (dynamic assembly), consulting firms (structured deliverables, hierarchical synthesis), and two-pizza teams (small autonomous squads, clear domain ownership).

Core Principles

1. Tiers represent cognitive modes, not org chart levels. Each tier thinks differently — strategy, design, coordination, execution, verification. Adding a tier only makes sense if it introduces a genuinely different mode of reasoning.

2. Depth is proportional to complexity. Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack.

3. Goal anchoring at every level. T1's original intent is embedded in every agent's context — not just passed to T2 and forgotten. Every agent knows the end goal even if they only own a slice.

4. Artifacts, not summaries. Tiers pass structured specs downward (JSON task briefs), not paraphrased prose. Meaning is preserved; format is compressed.

5. Verification is bidirectional. Lower tiers verify correctness. Upper tiers verify alignment with original intent. Both directions catch different failure modes.

6. Provider agnostic. The system makes no assumptions about which LLM provider or platform is in use. Tiers reference capability levels, not specific models. All external dependencies are swappable adapters.

7. Specialist talent pool. Tiers define structure and responsibility. Agent personalities define domain expertise. The two are separate — the same tier can be filled by different specialists depending on the workstream domain.

Tier Definitions

Tier	Role	Owns	Capability Level
T1	Visionary	Goal, constraints, final acceptance, architectural bets	reasoning-heavy
T2	Architect	System design, interface contracts, workstream boundaries	reasoning-heavy / capable
T3	Squad Lead	Workstream delivery, worker coordination, quality gate	capable
T4	Implementer	Atomic task execution (one file, one function, one test)	fast-cheap
T5	Verifier	Validation of T4 output — correctness + intent alignment	capable

T5 runs parallel to T4, not above it. It's a quality gate, not a management layer.

Capability levels map to actual models per provider in config — the core system never references a specific model name.

Variable Depth

Config change          T3 → T4
New feature            T2 → T3 → T4
Major refactor         T1 → T2 → T3 → T4 → T5
New system / product   T1 → T2 → T3s (parallel) → T4s → T5s

T3 assesses scope on receipt. If a task is simple enough, it handles it directly without spawning upward or waiting for T2 sign-off.

Horizontal Scaling Within Tiers

Each tier can have multiple agents running in parallel:

T1 (1–2 agents)
├── T2: Backend Architect
│   ├── T3: API Squad Lead
│   │   ├── T4: Worker — endpoint A
│   │   ├── T4: Worker — endpoint B
│   │   └── T5: Verifier
│   └── T3: DB Squad Lead
│       ├── T4: Worker — migrations
│       └── T5: Verifier
├── T2: Frontend Architect
│   └── T3: UI Squad Lead
│       ├── T4: Worker — component X
│       └── T4: Worker — component Y
└── T2: Infra Architect
    └── T3: Platform Squad Lead
        └── T4: Worker — config / deploy

Shared State

For software pipelines, the repo is the primary blackboard:

T4 workers commit to feature branches
T3 leads review and merge to workstream branches
T2 architects own integration branches
T1 does final integration and acceptance

Supplemented by a SQLite coordination store per run tracking in-flight workstreams, handoff artifacts, tier status, and retry counts.

Failure Handling

Failure	Handler	Action
T4 bad output	T3	Retry T4 with corrected brief (up to retry_budget)
T4 blocked	T3	Escalate immediately — no retries
T4 partial output	T3	Salvage good parts, re-task remainder
T3 workstream stuck	T2	Re-scope or split the workstream
T2 design wrong	T1	Re-plan; may discard workstream and restart
Repeated escalation	Surface to user	Block until human unblocks

Retry limits prevent infinite loops. Escalation path is always upward, never sideways.

Agent Talent Pool

The system builds on agency-agents — a library of 50+ pre-built specialist personalities, each with deep domain expertise, quality standards, and specific deliverables.

Division of responsibility:

Our system provides: orchestration, tier structure, task briefs, retries, verification gates, shared state
Agency-agents provides: the specialist knowledge each agent brings to its role

T1 selects the right specialist from the roster when building workstream briefs. The specialist's personality is injected as the system prompt at spawn time.

Default tier-to-specialist mapping for software pipelines:

Tier	Domain	Agent
T1	Strategy	nexus-strategy
T2	Backend	software-architect
T2	Infra	devops-automator
T2	Data	data-engineer
T3	Backend	senior-developer
T3	Reliability	sre
T4	Frontend	frontend-developer
T4	Backend	backend-architect
T4	Database	database-optimizer
T4	DevOps	devops-automator
T4	Mobile	mobile-app-builder
T4	AI/ML	ai-engineer
T4	Security	security-engineer
T4	Docs	technical-writer
T5	Code review	code-reviewer
T5	Integration	testing-reality-checker
T5	API	testing-api-tester
T5	Performance	testing-performance-benchmarker
T5	Security	security-engineer

The roster is not fixed — T1 can select any agent from the library based on workstream needs. Non-engineering agents (design, marketing, product) extend the system to non-software pipelines.

Adapter Layers

Everything external is a swappable adapter. Core logic never imports from adapters directly — always through an interface.

Core (platform-agnostic)
├── team_runner      — run lifecycle, agent spawning, runtime selection
├── blackboard       — SQLite coordination state
├── task_brief       — schema + validation
└── escalation       — retry logic, failure routing

Adapters (swappable)
├── llm/             — anthropic (now), openai, ollama, any API
├── notify/          — openclaw (now), slack, email, webhook...
├── vcs/             — github (now), gitlab, gitea, bare git...
└── runtime/
    ├── standard     — openclaw sessions_spawn (T1/T2/T3)
    └── coding_agent — claude_code (T4/T5 default), codex, aider...

Swapping providers means writing a new adapter file — nothing in core changes.

T4 and T5 default to the coding agent runtime when available. It provides direct file system access, git operations, and test execution — no need to shuttle file contents through message context. Falls back to standard runtime gracefully if not configured.

Decisions

Depth decision — T1 assesses scope on receipt and determines how many tiers to engage. Not pre-configured per task type.

Trigger mechanism — User messages Hans → Hans spins up T1 with the goal. T1 takes it from there.

Output / review — Nothing merges to main without Andrew's explicit approval. T1 opens a PR and surfaces it to Andrew for review. Merge is gated on human sign-off. Notification is dual: Hans messages Andrew directly, and a PR is opened on the VCS platform so Andrew gets notified natively too. This keeps the review step platform-independent — whichever VCS is in use, Hans always notifies Andrew directly as a fallback.

Retry limits — Three failure types, handled differently:

Bad output → retry T4 with a corrected brief (default: 3 retries)
Blocked → escalate immediately, no retries
Partial output → salvage good parts, re-task the remainder

T1 sets a retry budget multiplier during scope assessment (1x simple, 2x complex). Retry budget is a field on the task brief — not hardcoded in the runner.

Platform agnosticism — Core logic is provider and platform agnostic. LLMs, VCS, notifications, and agent runtimes are all adapters. Tiers reference capability levels (reasoning-heavy, capable, fast-cheap), not specific model names. Provider-to-model mapping lives in config.

LLM provider — Anthropic first implementation. Config supports per-tier provider selection and mixing providers across tiers (e.g. T1 on OpenAI o3, T4 workers on local Ollama).

Gateway modification — Decided against. Agent-teams stays standalone Python. OpenClaw is used as the runtime adapter via existing primitives (sessions_spawn, sessions_send, subagents) — called through a skill layer. No gateway fork. Keeps platform agnosticism intact and avoids Node/Python mismatch and fork maintenance burden.

Coding agent runtime — Claude Code is the default T4/T5 runtime for software pipelines. It is purpose-built for implementation and verification: direct file access, git ops, test execution. Enters as a runtime adapter — swappable for Codex, Aider, or any equivalent. T1/T2/T3 always use the standard runtime (they reason, they don't edit files).

Claude Code native teams — Claude Code has an experimental agent teams feature that fans out sub-agents internally within a session. Integrated as an opt-in flag (native_teams: true) in the coding_agent runtime adapter. When enabled, T3 hands a full workstream to Claude Code and it parallelises internally — faster, but less granular blackboard visibility. Default is false — explicit T4 spawning is the baseline; native teams is a speed optimisation to enable deliberately.

Agency-agents integration — Agent personalities sourced from msitarzewski/agency-agents via git submodule. Included as agents/ in the repo. T1 selects specialists from the roster via config/role_registry.yaml. Each task brief carries an agent_personality field (path to the agent .md file) which the runtime adapter injects as the system prompt at spawn time. Adding new specialists means adding an entry to the registry — no core changes required.

10 KiB Raw Blame History Unescape Escape