Files
the-agency/docs/buildspec.md
Hans Heinemann 8f143e779d docs: resolve remaining 3 design questions (spawn ownership, gate UX, mesh timeout)
- Spawn calls: runner owns all runtime_adapter.spawn() calls; tiers write
  status=pending briefs to blackboard, runner's spawn loop acts on them.
  Gate logic lives in the spawn loop — no gate plumbing needed in agents.

- Gate approval UX: Signal reply via Hans + direct CLI both supported.
  Both write gate_approved to blackboard; runner doesn't care which path.
  Hans uses pending_gates.json for multi-run disambiguation.

- T3 mesh timeout: escalate to T2 (domain boundary problem). If T2 also
  exhausts retry budget, normal escalation ladder handles it. No force-commit.

Add pending_gates.json to directory structure and buildspec.
Update runner step in build order with full spawn loop responsibilities.
2026-03-30 14:22:39 -04:00

510 lines
17 KiB
Markdown

# Tiered Agent Team System — Build Spec
_Started: 2026-03-15. Last updated: 2026-03-30._
_See design.md for the design doc and decisions log._
---
## Language & Runtime
**Python 3.11+.** Reasons:
- Agent/AI tooling is Python-first
- Clean type hints + dataclasses for schemas
- Agents can read and modify their own orchestration code
- Runs anywhere — no Node, no OpenClaw dependency
---
## Repository
Standalone repo: `git@github.com:coding-with-hans-heinemann/the-agency.git`
Separate from the OpenClaw workspace. OpenClaw workspace gets a thin integration layer that calls into it. Core is portable and runnable without OpenClaw.
---
## Directory Structure
```
agent-teams/
├── core/
│ ├── team_runner.py — run lifecycle, agent spawning
│ ├── blackboard.py — SQLite coordination state
│ ├── task_brief.py — schema + validation
│ └── escalation.py — retry logic, failure routing
├── adapters/
│ ├── base/
│ │ ├── llm.py — abstract LLM interface
│ │ ├── vcs.py — abstract VCS interface
│ │ ├── notify.py — abstract notification interface
│ │ └── runtime.py — abstract agent runtime interface
│ ├── llm/
│ │ ├── anthropic.py — Claude via OpenClaw or direct API
│ │ ├── openai.py — GPT / o-series
│ │ └── ollama.py — local models
│ ├── vcs/
│ │ └── github.py
│ ├── notify/
│ │ └── openclaw.py — messages Hans who notifies Andrew
│ └── runtime/
│ ├── openclaw.py — sessions_spawn (general purpose)
│ └── claude_code.py — coding agent runtime (file/git/exec tools)
├── agents/ — git submodule: msitarzewski/agency-agents
│ ├── engineering/
│ ├── testing/
│ ├── strategy/
│ └── ... — full agency-agents roster
├── prompts/
│ ├── t1_visionary.md — fallback if no agent_personality set
│ ├── t2_architect.md
│ ├── t3_squad_lead.md
│ ├── t4_implementer.md
│ └── t5_verifier.md
├── config/
│ ├── team.yaml — example run configuration
│ └── role_registry.yaml — maps (tier, domain) → agent personality file
├── cli/
│ └── agency.py — run, watch, inspect, approve, reject, pause, resume
├── runs/ — runtime state, one subdir per run_id
│ └── .gitkeep
├── pending_gates.json — live file: gates currently awaiting approval (written by runner, read by Hans)
└── README.md
```
---
## Blackboard
SQLite. One file per run at `runs/<run_id>/blackboard.db`.
### Tables
**runs**
```sql
CREATE TABLE runs (
run_id TEXT PRIMARY KEY,
goal TEXT NOT NULL,
status TEXT NOT NULL, -- pending | active | review | done | failed
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
```
**workstreams**
```sql
CREATE TABLE workstreams (
workstream_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
name TEXT NOT NULL,
tier INTEGER NOT NULL,
status TEXT NOT NULL, -- pending | active | blocked | done | failed
owner_agent_id TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
```
**briefs**
```sql
CREATE TABLE briefs (
brief_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
parent_brief_id TEXT,
workstream_id TEXT,
tier INTEGER NOT NULL,
role TEXT NOT NULL,
status TEXT NOT NULL, -- pending | active | done | failed
payload TEXT NOT NULL, -- full JSON brief
result TEXT, -- JSON result when done
retry_count INTEGER DEFAULT 0,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
```
**events**
```sql
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
brief_id TEXT,
kind TEXT NOT NULL, -- see event vocabulary below
detail TEXT, -- JSON
created_at TEXT NOT NULL
);
```
**Event kind vocabulary:**
```
-- lifecycle
spawned | completed | failed | escalated | retried
-- visibility / gates
gate_pending -- runner hit an inspection gate, waiting for human
gate_approved -- human approved via CLI or notify
gate_rejected -- human rejected, tier re-invoked
gate_paused -- manual pause via CLI
gate_resumed -- manual resume via CLI
-- amendments / informational
path_amendment -- mid-run tier proposed a tier path change
log -- human-readable log line (detail: {level, message})
```
**t3_task_lists** *(T3 mesh coordination)*
```sql
CREATE TABLE t3_task_lists (
entry_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
workstream_id TEXT NOT NULL,
t3_agent_id TEXT NOT NULL,
status TEXT NOT NULL, -- draft | committed
tasks TEXT NOT NULL, -- JSON array of proposed T4 task descriptors
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
```
---
## Task Brief Schema
Every brief passed between tiers is a validated JSON object. `goal_anchor` is immutable — set by T1, copied verbatim into every downstream brief.
```json
{
"brief_id": "uuid",
"run_id": "uuid",
"parent_brief_id": "uuid | null",
"tier": 4,
"role": "implementer",
"goal_anchor": "Original T1 intent — always propagated unchanged",
"workstream": "backend-api",
"task": "Implement POST /webhooks/ingest endpoint",
"acceptance_criteria": [
"Accepts JSON payload",
"Returns 202 on success",
"Writes to queue"
],
"constraints": [
"Use existing queue client in src/queue.py",
"No new dependencies"
],
"context": {
"relevant_files": ["src/routes/webhooks.py", "src/queue.py"],
"interface_contract": "..."
},
"retry_budget": 3,
"retry_count": 0,
"preferred_runtime": "coding_agent",
"agent_personality": "agents/engineering/engineering-code-reviewer.md",
"created_at": "ISO-8601"
}
```
`preferred_runtime` is optional. T3 sets it to `"coding_agent"` when spawning T4/T5 for implementation or verification tasks. Runner falls back to `"standard"` if the coding agent runtime is not configured.
`agent_personality` is optional. When set, the runtime adapter reads the file and injects its contents as the system prompt at spawn time. Falls back to the generic tier prompt in `prompts/` if not set.
```
```
---
## Adapter Interfaces
### LLM (`adapters/base/llm.py`)
```python
class LLMAdapter:
def complete(self, prompt: str, capability: str, context: dict) -> str
def resolve_model(self, capability: str) -> str
# capability: "reasoning-heavy" | "capable" | "fast-cheap"
```
### VCS (`adapters/base/vcs.py`)
```python
class VCSAdapter:
def create_branch(self, name: str) -> None
def commit(self, files: list[str], message: str) -> str # returns commit sha
def create_pr(self, title: str, body: str, head: str, base: str) -> str # returns pr url
def get_pr_status(self, pr_id: str) -> str # open | merged | closed
```
### Notify (`adapters/base/notify.py`)
```python
class NotifyAdapter:
def send(self, message: str, context: dict) -> None
```
### Runtime (`adapters/base/runtime.py`)
```python
class RuntimeAdapter:
def spawn(self, task: str, capability: str, context: dict) -> str # returns agent_id
def get_result(self, agent_id: str, timeout_s: int) -> dict
def kill(self, agent_id: str) -> None
# Two implementations:
# openclaw.py — general purpose, uses sessions_spawn, suits T1/T2/T3
# claude_code.py — coding-specialized, has file/git/exec tools, suits T4/T5
#
# The runner selects runtime based on brief.preferred_runtime:
# "standard" → openclaw.py (default)
# "coding_agent" → claude_code.py (falls back to standard if unavailable)
#
# Both implementations inject brief.agent_personality as the system prompt
# when spawning, if present. Falls back to generic tier prompt otherwise.
# claude_code.py passes the agent file via --system-prompt flag natively
# (agency-agents was designed for Claude Code's agents/ directory).
```
---
## Run Config (`config/team.yaml`)
```yaml
run:
goal: "Build webhook ingestion system with retry logic and DLQ"
repo: "git@github.com:org/repo.git"
base_branch: "main"
adapters:
llm: anthropic
vcs: github
notify: openclaw
runtime: openclaw
models:
provider: anthropic # default provider
capability_map:
reasoning-heavy:
anthropic: claude-opus-4-6
openai: o3
capable:
anthropic: claude-sonnet-4-6
openai: gpt-4o
ollama: llama3.1:70b
fast-cheap:
anthropic: claude-haiku-3-5
openai: gpt-4o-mini
ollama: llama3.2
# optional: override provider per tier
tier_overrides:
t1: { provider: openai, capability: reasoning-heavy }
t4: { provider: ollama, capability: fast-cheap }
runtime:
default: openclaw
coding_agent: claude_code # used for T4/T5 when available; omit to disable
native_teams: false # Claude Code's experimental agent teams — opt-in only
# when true: T3 hands full workstream to Claude Code,
# which fans out internally. faster but less blackboard
# visibility. default: false (explicit T4 spawning)
# tier_runtime_map (optional overrides):
# t1: standard
# t2: standard
# t3: standard
# t4: coding_agent
# t5: coding_agent
retry_defaults:
bad_output: 3
partial: 2
blocked: 0 # always escalate immediately
visibility:
strict_mode: false # true = all gates on (recommended for first runs)
log_level: normal # normal | verbose (verbose = per-T4 start/done lines)
inspection_gates:
t1_plan: true # always — required by design
t2_lead: false # optional — review boundaries before specialists spawn
t2_synthesis: true # recommended — review architecture before implementation
t3_plan: false # verbose — useful early on, disable once T3 is trusted
t5_verdict: false # review T5 joint verdict before T3 marks workstream done
gate_timeout_minutes: 60 # auto-reject if no human response within this window
t3_mesh_timeout_minutes: 10 # max time for T3s to commit task lists before runner escalates
```
---
## Role Registry (`config/role_registry.yaml`)
Maps `(tier, domain)` → agent personality file. T1 consults this during scope assessment when selecting specialists for each workstream brief. Adding a new specialist means adding one entry here — no core changes.
```yaml
t1:
default: agents/strategy/nexus-strategy.md
t2:
backend: agents/engineering/engineering-software-architect.md
frontend: agents/engineering/engineering-software-architect.md
infra: agents/engineering/engineering-devops-automator.md
data: agents/engineering/engineering-data-engineer.md
default: agents/engineering/engineering-software-architect.md
t3:
backend: agents/engineering/engineering-senior-developer.md
frontend: agents/engineering/engineering-senior-developer.md
infra: agents/engineering/engineering-sre.md
default: agents/engineering/engineering-senior-developer.md
t4:
frontend: agents/engineering/engineering-frontend-developer.md
backend: agents/engineering/engineering-backend-architect.md
database: agents/engineering/engineering-database-optimizer.md
devops: agents/engineering/engineering-devops-automator.md
mobile: agents/engineering/engineering-mobile-app-builder.md
ai: agents/engineering/engineering-ai-engineer.md
security: agents/engineering/engineering-security-engineer.md
docs: agents/engineering/engineering-technical-writer.md
default: agents/engineering/engineering-senior-developer.md
t5:
code: agents/engineering/engineering-code-reviewer.md
integration: agents/testing/testing-reality-checker.md
api: agents/testing/testing-api-tester.md
performance: agents/testing/testing-performance-benchmarker.md
security: agents/engineering/engineering-security-engineer.md
default: agents/engineering/engineering-code-reviewer.md
```
```yaml
```
---
## Key Flows
### 1. Run Kickoff
```
User → Hans → team_runner.start(goal, config)
→ generate run_id
→ init blackboard (create runs/<run_id>/blackboard.db)
→ build T1 brief (goal_anchor = goal, retry_budget from config)
→ spawn T1 via runtime adapter
→ await T1 workplan
```
### 2. T1 Scope Assessment
```
T1 receives brief
→ assess complexity → decide depth
→ identify workstreams
→ set retry_budget multiplier per workstream (1x simple, 2x complex)
→ emit N workstream briefs for T2 (or T3 if shallow)
→ write workplan to blackboard
→ team_runner spawns T2s in parallel
```
### 3. T4 Retry Loop (escalation.py)
```
spawn T4 with brief
→ receive result
→ classify: bad_output | blocked | partial | success
blocked:
→ log event(escalated)
→ pass to T3 immediately
bad_output, retries_remaining:
→ amend brief with failure context, increment retry_count
→ re-spawn T4
→ log event(retried)
bad_output, retries_exhausted:
→ log event(escalated)
→ pass to T3
partial:
→ write salvageable parts to blackboard
→ re-task remainder with new brief
success:
→ write result to blackboard
→ log event(completed)
→ notify T3
```
### 4. Inspection Gate Flow
```
runner reaches configured gate (e.g. t2_synthesis)
→ write event(gate_pending, detail={tier, summary, what_happens_next})
→ notify_adapter.send(tier summary to Andrew via Hans)
→ halt: poll blackboard for gate_approved or gate_rejected
gate_approved:
→ write event(gate_approved)
→ continue run
gate_rejected:
→ write event(gate_rejected, detail={reason})
→ re-invoke tier with rejection reason in brief context
→ loop back to gate_pending when tier completes again
gate_timeout (gate_timeout_minutes elapsed):
→ treat as gate_rejected
→ notify Andrew: "Gate timed out, re-invoking tier"
```
### 5. Review Gate
```
T1 completes integration
→ vcs_adapter.create_pr(
title="[agent-teams] <run_id>: <goal summary>",
body="<workplan + workstream summaries>",
head="integration/<run_id>",
base="main"
)
→ notify_adapter.send(
"Run <run_id> complete. PR ready for review: <pr_url>",
context={run_id, goal, workstreams, pr_url}
)
→ blackboard: update run status → "review"
→ halt — no auto-merge
```
---
## Build Order
1. `git submodule add https://github.com/msitarzewski/agency-agents agents/` — pull the talent pool
2. `config/role_registry.yaml` — map tier+domain → agent personality files
3. `core/task_brief.py` — schema + validation (everything depends on this); include T1 Plan Output Schema
4. `core/blackboard.py` — SQLite store, all table definitions including `t3_task_lists`; full event kind vocabulary
5. `adapters/base/*` — all four abstract interfaces
6. `adapters/llm/anthropic.py` — first LLM implementation
7. `core/escalation.py` — retry + failure routing logic (called by tiers, not runner centrally)
8. `adapters/runtime/openclaw.py` — wire up sessions_spawn + personality injection
9. `adapters/runtime/claude_code.py` — coding agent runtime, personality via --system-prompt
10. `core/team_runner.py` — full run lifecycle: spawn loop (monitors briefs table for `status=pending`, calls runtime_adapter.spawn()), gate logic (gate_pending halt, writes pending_gates.json, gate_approved/rejected resume), path amendment monitor, T3 mesh timeout → T2 escalation, T1 failure + terminal escalation only
11. `cli/agency.py` — run, watch, inspect, approve, reject, pause, resume; `watch` tails blackboard events and renders live log; `inspect` renders run tree
12. `prompts/` — fallback tier prompts (used when no agent_personality set)
13. `adapters/vcs/github.py` — PR creation + branch management
14. `adapters/notify/openclaw.py` — Hans notification; used for gate surfaces (tier summary to Andrew)
15. `config/team.yaml` — example config with full visibility block
16. `README.md` — how to run, how to add adapters, how to extend the roster; include `agency` CLI reference
---
## Out of Scope (Phase 2)
- Cost accounting per tier + run rollup
- Parallel workstream progress dashboard
- Additional adapter implementations (GitLab, Slack, OpenAI, Ollama)
- Persistent standing teams
- Web UI for run monitoring