Files

Hans Heinemann a721db63f6 docs: lock in visibility layer, resolve all 5 open design questions

- Resolve T3 mesh mechanics: blackboard-based draft/commit cycle
- Resolve T1 plan output schema: formal JSON structure with workstreams + parallelism groups
- Resolve T5 consensus: T3 aggregates joint verdict (pass/partial/fail), partial retries failed slices only
- Resolve path amendment mechanism: event-based, runner notifies higher tier, no approval gate
- Resolve failure handling: confirmed distributed ownership, runner owns T1 + terminal only

Add run visibility layer:
- Human-readable live log (normal + verbose modes)
- Configurable inspection gates (t1_plan always, t2_synthesis recommended, others optional)
- strict_mode flag for full gating on early runs
- cli/agency.py: run, watch, inspect, approve, reject, pause, resume
- gate_pending halt loop in team_runner, gate_approved/rejected resume
- Expanded blackboard event vocabulary (gate_*, path_amendment, log)
- t3_task_lists table for mesh coordination state
- Inspection gate flow added to buildspec Key Flows

Build order updated: 16 steps (added cli/ step, clarified runner gate responsibilities)

2026-03-30 13:43:19 -04:00

16 KiB

Raw Blame History

Tiered Agent Team System — Build Spec

Started: 2026-03-15. Last updated: 2026-03-30. See design.md for the design doc and decisions log.

Language & Runtime

Python 3.11+. Reasons:

Agent/AI tooling is Python-first
Clean type hints + dataclasses for schemas
Agents can read and modify their own orchestration code
Runs anywhere — no Node, no OpenClaw dependency

Repository

Standalone repo: git@github.com:coding-with-hans-heinemann/the-agency.git

Separate from the OpenClaw workspace. OpenClaw workspace gets a thin integration layer that calls into it. Core is portable and runnable without OpenClaw.

Directory Structure

agent-teams/
├── core/
│   ├── team_runner.py       — run lifecycle, agent spawning
│   ├── blackboard.py        — SQLite coordination state
│   ├── task_brief.py        — schema + validation
│   └── escalation.py        — retry logic, failure routing
│
├── adapters/
│   ├── base/
│   │   ├── llm.py           — abstract LLM interface
│   │   ├── vcs.py           — abstract VCS interface
│   │   ├── notify.py        — abstract notification interface
│   │   └── runtime.py       — abstract agent runtime interface
│   ├── llm/
│   │   ├── anthropic.py     — Claude via OpenClaw or direct API
│   │   ├── openai.py        — GPT / o-series
│   │   └── ollama.py        — local models
│   ├── vcs/
│   │   └── github.py
│   ├── notify/
│   │   └── openclaw.py      — messages Hans who notifies Andrew
│   └── runtime/
│       ├── openclaw.py      — sessions_spawn (general purpose)
│       └── claude_code.py   — coding agent runtime (file/git/exec tools)
│
├── agents/                  — git submodule: msitarzewski/agency-agents
│   ├── engineering/
│   ├── testing/
│   ├── strategy/
│   └── ...                  — full agency-agents roster
│
├── prompts/
│   ├── t1_visionary.md      — fallback if no agent_personality set
│   ├── t2_architect.md
│   ├── t3_squad_lead.md
│   ├── t4_implementer.md
│   └── t5_verifier.md
│
├── config/
│   ├── team.yaml            — example run configuration
│   └── role_registry.yaml   — maps (tier, domain) → agent personality file
│
├── cli/
│   └── agency.py            — run, watch, inspect, approve, reject, pause, resume
│
├── runs/                    — runtime state, one subdir per run_id
│   └── .gitkeep
│
└── README.md

Blackboard

SQLite. One file per run at runs/<run_id>/blackboard.db.

Tables

runs

CREATE TABLE runs (
    run_id      TEXT PRIMARY KEY,
    goal        TEXT NOT NULL,
    status      TEXT NOT NULL,  -- pending | active | review | done | failed
    created_at  TEXT NOT NULL,
    updated_at  TEXT NOT NULL
);

workstreams

CREATE TABLE workstreams (
    workstream_id   TEXT PRIMARY KEY,
    run_id          TEXT NOT NULL,
    name            TEXT NOT NULL,
    tier            INTEGER NOT NULL,
    status          TEXT NOT NULL,  -- pending | active | blocked | done | failed
    owner_agent_id  TEXT,
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL
);

briefs

CREATE TABLE briefs (
    brief_id        TEXT PRIMARY KEY,
    run_id          TEXT NOT NULL,
    parent_brief_id TEXT,
    workstream_id   TEXT,
    tier            INTEGER NOT NULL,
    role            TEXT NOT NULL,
    status          TEXT NOT NULL,  -- pending | active | done | failed
    payload         TEXT NOT NULL,  -- full JSON brief
    result          TEXT,           -- JSON result when done
    retry_count     INTEGER DEFAULT 0,
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL
);

events

CREATE TABLE events (
    event_id    TEXT PRIMARY KEY,
    run_id      TEXT NOT NULL,
    brief_id    TEXT,
    kind        TEXT NOT NULL,  -- see event vocabulary below
    detail      TEXT,           -- JSON
    created_at  TEXT NOT NULL
);

Event kind vocabulary:

-- lifecycle
spawned | completed | failed | escalated | retried

-- visibility / gates
gate_pending    -- runner hit an inspection gate, waiting for human
gate_approved   -- human approved via CLI or notify
gate_rejected   -- human rejected, tier re-invoked
gate_paused     -- manual pause via CLI
gate_resumed    -- manual resume via CLI

-- amendments / informational
path_amendment  -- mid-run tier proposed a tier path change
log             -- human-readable log line (detail: {level, message})

t3_task_lists (T3 mesh coordination)

CREATE TABLE t3_task_lists (
    entry_id        TEXT PRIMARY KEY,
    run_id          TEXT NOT NULL,
    workstream_id   TEXT NOT NULL,
    t3_agent_id     TEXT NOT NULL,
    status          TEXT NOT NULL,  -- draft | committed
    tasks           TEXT NOT NULL,  -- JSON array of proposed T4 task descriptors
    created_at      TEXT NOT NULL,
    updated_at      TEXT NOT NULL
);

Task Brief Schema

Every brief passed between tiers is a validated JSON object. goal_anchor is immutable — set by T1, copied verbatim into every downstream brief.

{
  "brief_id": "uuid",
  "run_id": "uuid",
  "parent_brief_id": "uuid | null",
  "tier": 4,
  "role": "implementer",
  "goal_anchor": "Original T1 intent — always propagated unchanged",
  "workstream": "backend-api",
  "task": "Implement POST /webhooks/ingest endpoint",
  "acceptance_criteria": [
    "Accepts JSON payload",
    "Returns 202 on success",
    "Writes to queue"
  ],
  "constraints": [
    "Use existing queue client in src/queue.py",
    "No new dependencies"
  ],
  "context": {
    "relevant_files": ["src/routes/webhooks.py", "src/queue.py"],
    "interface_contract": "..."
  },
  "retry_budget": 3,
  "retry_count": 0,
  "preferred_runtime": "coding_agent",
  "agent_personality": "agents/engineering/engineering-code-reviewer.md",
  "created_at": "ISO-8601"
}

preferred_runtime is optional. T3 sets it to "coding_agent" when spawning T4/T5 for implementation or verification tasks. Runner falls back to "standard" if the coding agent runtime is not configured.

agent_personality is optional. When set, the runtime adapter reads the file and injects its contents as the system prompt at spawn time. Falls back to the generic tier prompt in prompts/ if not set.

Adapter Interfaces

LLM (`adapters/base/llm.py`)

class LLMAdapter:
    def complete(self, prompt: str, capability: str, context: dict) -> str
    def resolve_model(self, capability: str) -> str
    # capability: "reasoning-heavy" | "capable" | "fast-cheap"

VCS (`adapters/base/vcs.py`)

class VCSAdapter:
    def create_branch(self, name: str) -> None
    def commit(self, files: list[str], message: str) -> str       # returns commit sha
    def create_pr(self, title: str, body: str, head: str, base: str) -> str  # returns pr url
    def get_pr_status(self, pr_id: str) -> str                    # open | merged | closed

Notify (`adapters/base/notify.py`)

class NotifyAdapter:
    def send(self, message: str, context: dict) -> None

Runtime (`adapters/base/runtime.py`)

class RuntimeAdapter:
    def spawn(self, task: str, capability: str, context: dict) -> str  # returns agent_id
    def get_result(self, agent_id: str, timeout_s: int) -> dict
    def kill(self, agent_id: str) -> None

# Two implementations:
#   openclaw.py    — general purpose, uses sessions_spawn, suits T1/T2/T3
#   claude_code.py — coding-specialized, has file/git/exec tools, suits T4/T5
#
# The runner selects runtime based on brief.preferred_runtime:
#   "standard"      → openclaw.py (default)
#   "coding_agent"  → claude_code.py (falls back to standard if unavailable)
#
# Both implementations inject brief.agent_personality as the system prompt
# when spawning, if present. Falls back to generic tier prompt otherwise.
# claude_code.py passes the agent file via --system-prompt flag natively
# (agency-agents was designed for Claude Code's agents/ directory).

Run Config (`config/team.yaml`)

run:
  goal: "Build webhook ingestion system with retry logic and DLQ"
  repo: "git@github.com:org/repo.git"
  base_branch: "main"

adapters:
  llm: anthropic
  vcs: github
  notify: openclaw
  runtime: openclaw

models:
  provider: anthropic          # default provider
  capability_map:
    reasoning-heavy:
      anthropic: claude-opus-4-6
      openai: o3
    capable:
      anthropic: claude-sonnet-4-6
      openai: gpt-4o
      ollama: llama3.1:70b
    fast-cheap:
      anthropic: claude-haiku-3-5
      openai: gpt-4o-mini
      ollama: llama3.2

  # optional: override provider per tier
  tier_overrides:
    t1: { provider: openai, capability: reasoning-heavy }
    t4: { provider: ollama, capability: fast-cheap }

runtime:
  default: openclaw
  coding_agent: claude_code     # used for T4/T5 when available; omit to disable
  native_teams: false           # Claude Code's experimental agent teams — opt-in only
                                # when true: T3 hands full workstream to Claude Code,
                                # which fans out internally. faster but less blackboard
                                # visibility. default: false (explicit T4 spawning)
  # tier_runtime_map (optional overrides):
  #   t1: standard
  #   t2: standard
  #   t3: standard
  #   t4: coding_agent
  #   t5: coding_agent

retry_defaults:
  bad_output: 3
  partial: 2
  blocked: 0    # always escalate immediately

visibility:
  strict_mode: false          # true = all gates on (recommended for first runs)
  log_level: normal           # normal | verbose (verbose = per-T4 start/done lines)
  inspection_gates:
    t1_plan: true             # always — required by design
    t2_lead: false            # optional — review boundaries before specialists spawn
    t2_synthesis: true        # recommended — review architecture before implementation
    t3_plan: false            # verbose — useful early on, disable once T3 is trusted
    t5_verdict: false         # review T5 joint verdict before T3 marks workstream done
  gate_timeout_minutes: 60    # auto-reject if no human response within this window

t3_mesh_timeout_minutes: 10   # max time for T3s to commit task lists before runner escalates

Role Registry (`config/role_registry.yaml`)

Maps (tier, domain) → agent personality file. T1 consults this during scope assessment when selecting specialists for each workstream brief. Adding a new specialist means adding one entry here — no core changes.

t1:
  default: agents/strategy/nexus-strategy.md

t2:
  backend:  agents/engineering/engineering-software-architect.md
  frontend: agents/engineering/engineering-software-architect.md
  infra:    agents/engineering/engineering-devops-automator.md
  data:     agents/engineering/engineering-data-engineer.md
  default:  agents/engineering/engineering-software-architect.md

t3:
  backend:  agents/engineering/engineering-senior-developer.md
  frontend: agents/engineering/engineering-senior-developer.md
  infra:    agents/engineering/engineering-sre.md
  default:  agents/engineering/engineering-senior-developer.md

t4:
  frontend:  agents/engineering/engineering-frontend-developer.md
  backend:   agents/engineering/engineering-backend-architect.md
  database:  agents/engineering/engineering-database-optimizer.md
  devops:    agents/engineering/engineering-devops-automator.md
  mobile:    agents/engineering/engineering-mobile-app-builder.md
  ai:        agents/engineering/engineering-ai-engineer.md
  security:  agents/engineering/engineering-security-engineer.md
  docs:      agents/engineering/engineering-technical-writer.md
  default:   agents/engineering/engineering-senior-developer.md

t5:
  code:        agents/engineering/engineering-code-reviewer.md
  integration: agents/testing/testing-reality-checker.md
  api:         agents/testing/testing-api-tester.md
  performance: agents/testing/testing-performance-benchmarker.md
  security:    agents/engineering/engineering-security-engineer.md
  default:     agents/engineering/engineering-code-reviewer.md

Key Flows

1. Run Kickoff

User → Hans → team_runner.start(goal, config)
  → generate run_id
  → init blackboard (create runs/<run_id>/blackboard.db)
  → build T1 brief (goal_anchor = goal, retry_budget from config)
  → spawn T1 via runtime adapter
  → await T1 workplan

2. T1 Scope Assessment

T1 receives brief
  → assess complexity → decide depth
  → identify workstreams
  → set retry_budget multiplier per workstream (1x simple, 2x complex)
  → emit N workstream briefs for T2 (or T3 if shallow)
  → write workplan to blackboard
  → team_runner spawns T2s in parallel

3. T4 Retry Loop (escalation.py)

spawn T4 with brief
  → receive result
  → classify: bad_output | blocked | partial | success

  blocked:
    → log event(escalated)
    → pass to T3 immediately

  bad_output, retries_remaining:
    → amend brief with failure context, increment retry_count
    → re-spawn T4
    → log event(retried)

  bad_output, retries_exhausted:
    → log event(escalated)
    → pass to T3

  partial:
    → write salvageable parts to blackboard
    → re-task remainder with new brief

  success:
    → write result to blackboard
    → log event(completed)
    → notify T3

4. Inspection Gate Flow

runner reaches configured gate (e.g. t2_synthesis)
  → write event(gate_pending, detail={tier, summary, what_happens_next})
  → notify_adapter.send(tier summary to Andrew via Hans)
  → halt: poll blackboard for gate_approved or gate_rejected

  gate_approved:
    → write event(gate_approved)
    → continue run

  gate_rejected:
    → write event(gate_rejected, detail={reason})
    → re-invoke tier with rejection reason in brief context
    → loop back to gate_pending when tier completes again

  gate_timeout (gate_timeout_minutes elapsed):
    → treat as gate_rejected
    → notify Andrew: "Gate timed out, re-invoking tier"

5. Review Gate

T1 completes integration
  → vcs_adapter.create_pr(
      title="[agent-teams] <run_id>: <goal summary>",
      body="<workplan + workstream summaries>",
      head="integration/<run_id>",
      base="main"
    )
  → notify_adapter.send(
      "Run <run_id> complete. PR ready for review: <pr_url>",
      context={run_id, goal, workstreams, pr_url}
    )
  → blackboard: update run status → "review"
  → halt — no auto-merge

Build Order

git submodule add https://github.com/msitarzewski/agency-agents agents/ — pull the talent pool
config/role_registry.yaml — map tier+domain → agent personality files
core/task_brief.py — schema + validation (everything depends on this); include T1 Plan Output Schema
core/blackboard.py — SQLite store, all table definitions including t3_task_lists; full event kind vocabulary
adapters/base/* — all four abstract interfaces
adapters/llm/anthropic.py — first LLM implementation
core/escalation.py — retry + failure routing logic (called by tiers, not runner centrally)
adapters/runtime/openclaw.py — wire up sessions_spawn + personality injection
adapters/runtime/claude_code.py — coding agent runtime, personality via --system-prompt
core/team_runner.py — full run lifecycle: gate logic (gate_pending halt loop, gate_approved resume), path amendment monitor, T1 failure + terminal escalation only
cli/agency.py — run, watch, inspect, approve, reject, pause, resume; watch tails blackboard events and renders live log; inspect renders run tree
prompts/ — fallback tier prompts (used when no agent_personality set)
adapters/vcs/github.py — PR creation + branch management
adapters/notify/openclaw.py — Hans notification; used for gate surfaces (tier summary to Andrew)
config/team.yaml — example config with full visibility block
README.md — how to run, how to add adapters, how to extend the roster; include agency CLI reference

Out of Scope (Phase 2)

Cost accounting per tier + run rollup
Parallel workstream progress dashboard
Additional adapter implementations (GitLab, Slack, OpenAI, Ollama)
Persistent standing teams
Web UI for run monitoring

16 KiB Raw Blame History