Files

Hans Heinemann 1c99e40f98 docs: purge OpenClaw/Hans specifics from core design

Portability audit — all platform-specific concerns moved to adapter layer:

- Gate Approval UX (Resolved Mechanics): rewritten as platform-agnostic.
  Core: runner writes gate_pending, calls notify_adapter.send(), polls
  blackboard for gate_approved. Universal path: agency CLI writes directly
  to blackboard. Adapter handles its own inbound response bridge internally.

- pending_gates.json removed from core directory structure and runner
  responsibilities — adapter-internal state, not a core concern.

- 'User → Hans → team_runner.start()' → 'User → team_runner.start()'
  Core has no dependency on a specific caller.

- 'notify_adapter.send(...to Andrew via Hans)' → 'notify_adapter.send()'
  throughout design.md and buildspec.md.

- anthropic.py description: 'via OpenClaw or direct API' → 'direct API'
  (anthropic adapter never goes via OpenClaw)

- Output/review decision: 'Hans messages Andrew' → 'notify_adapter.send()'
- Run visibility decision: 'Andrew via Hans' → 'via notify_adapter.send()'
- Decisions log: gate approval and visibility entries rewritten accordingly

Adapter layer correctly unchanged:
  adapters/notify/openclaw.py — OpenClaw-specific, owns its inbound bridge
  adapters/runtime/openclaw.py — OpenClaw sessions_spawn, correctly isolated
  team.yaml example config — adapter selection is config, not core

2026-03-30 14:31:55 -04:00

34 KiB

Raw Blame History

Tiered Agent Team System — Design Document

Started: 2026-03-14. Last updated: 2026-03-30.

Resolved Design Decisions (formerly Open Questions)

All eight open questions resolved 2026-03-30. Details in Decisions Log.

T3 mesh mechanics → Blackboard-based. T3s write draft task lists, read peers', commit merged plan before T4 dispatch. See T3 Mesh via Blackboard.
T1 output schema → Formal JSON schema defined. See T1 Plan Output Schema.
T5 consensus mechanics → T3 aggregates all T5 results into a joint verdict. Split verdict (partial) triggers retry of failed slices only. See T5 Consensus & Verdict Schema.
Path amendment mechanism → Amending tier writes a path_amendment event to blackboard. Runner monitors events table and notifies the relevant higher tier via a system event. No agent callback plumbing required. See Path Amendment Mechanism.
Failure handling (distributed model) → Distributed ownership confirmed. Runner only owns T1 failure + terminal human escalation. See updated Failure Handling table.
Who makes spawn calls for T3+ tiers → Runner monitors briefs table for status=pending rows and makes all spawn calls. "Distributed ownership" means the tier's output determines brief content — runner is the mechanical arm. Gates (hold on gate_pending) live naturally in the runner's spawn loop.
Gate approval UX → agency approve <run_id> CLI writes gate_approved directly to the blackboard — the universal path, works on any platform. Runner only cares that the event exists, not how it got there. Notify adapter implementations handle their own inbound response routing (e.g. bridging a chat reply to a CLI call) as internal adapter state — not a core concern.
T3 mesh timeout → Escalate to T2 (domain boundary problem, T2 should re-scope). If T2 also exhausts its retry budget, escalates up the normal ladder to T1 → Andrew gate. No force-commit fallback (would hide the problem and cause bad T4 dispatch).

Overview

A dynamic, hierarchical multi-agent system for software pipelines. Teams assemble on demand, execute, then disband. Inspired by a blend of Hollywood production (dynamic assembly), consulting firms (structured deliverables, hierarchical synthesis), and two-pizza teams (small autonomous squads, clear domain ownership).

Core Principles

1. Tiers represent cognitive modes, not org chart levels. Each tier thinks differently — strategy, design, coordination, execution, verification. Adding a tier only makes sense if it introduces a genuinely different mode of reasoning.

2. Depth is proportional to complexity. Not every task needs every tier. A config change might only need T3→T4. A new product needs the full stack. T1 assesses scope and prescribes the path — it is never pre-configured.

3. Goal anchoring at every level. T1's original intent is embedded in every agent's context — not just passed to T2 and forgotten. Every agent knows the end goal even if they only own a slice.

4. Artifacts, not summaries. Tiers pass structured specs downward (JSON task briefs), not paraphrased prose. Meaning is preserved; format is compressed.

5. Verification is mandatory. T5 always runs. Nothing returns to T1 unverified. T5 is a quality gate, not optional — things should work and work well before they surface upward.

6. Provider agnostic. The system makes no assumptions about which LLM provider or platform is in use. Tiers reference capability levels, not specific models. All external dependencies are swappable adapters.

7. Specialist talent pool. Tiers define structure and responsibility. Agent personalities define domain expertise. The two are separate — the same tier can be filled by different specialists depending on the workstream domain.

Tier Definitions

Tier	Role	Owns	Capability Level
T1	Visionary	Goal, constraints, dispatch plan, final acceptance	reasoning-heavy
T2	Architect	System design, interface contracts, workstream boundaries	reasoning-heavy / capable
T3	Squad Lead	Workstream delivery, T4 management, quality gate	capable
T4	Implementer	Atomic task execution (one file, one function, one test)	fast-cheap
T5	Verifier	Validation of T4 output — correctness + intent alignment	capable

T5 runs within T3's scope, not above it. T3 commissions T5 verification of its T4 outputs. T5 is a quality gate, not a management layer.

Capability levels map to actual models per provider in config — the core system never references a specific model name.

Dispatch Model

T1 Owns the Plan

T1 is not just a decomposer — it is the dispatch planner. Its output declares:

Workstreams — the decomposed units of work
Tier path per workstream — which tiers to engage (e.g. [T2, T3, T4, T5] or [T4, T5] for trivial tasks)
Parallelism — which workstreams are independent and can run concurrently

T1 does not prescribe how each tier operates internally. That is the tier's own concern.

T1 Lifecycle — Two Explicit Phases

T1 is invoked twice per run, each with a distinct prompt and purpose:

Phase 1 — Plan:

T1 produces initial dispatch plan (workstreams, tier paths, parallelism, retry budget)
T1 self-critiques its own plan in a single follow-up pass ("what could go wrong, what did I miss?") and amends
Amended plan surfaces to Andrew for approval — no T2s spawn until approval is given

Phase 2 — Accept: After the full T2→T3→T4→T5 pipeline completes, T1 is re-invoked with the final output. It validates against the original goal anchor and either accepts (opens PR) or rejects (escalates back down).

Both phases are named explicitly in the task brief schema and tracked on the blackboard.

Each Tier Owns the Layer Below

Control flow is distributed, not centralised:

T1 manages its T2s
T2 Lead manages T2 specialists and their domain boundaries
T2 specialists each own their T3s
T3 manages its T4s — including dependency graph, parallelism, and T5 commissioning
The runner is thin: bootstrap T1, monitor the blackboard, handle final result and notifications

This means orchestration logic lives in agent prompts and output schemas — not in Python runner code. Adding a new execution pattern means updating a prompt, not the runner.

Tradeoff: Debugging is harder. When something fails mid-chain, you read blackboard logs rather than step through central runner code. This is a tooling problem to solve (good blackboard inspection), not a design flaw to avoid.

Dynamic Paths

Tiers can propose path amendments mid-run (e.g. T3 discovers scope that warrants a T2 pass it didn't get). Amendments are logged to the blackboard. Higher tiers are notified but do not need to approve — it is informational. No tier silently changes the plan.

Orchestration Patterns Per Tier

Different tiers suit different internal coordination patterns. These are baked into the runner's tier-handling logic and the tier prompts — not prescribed by T1.

Tier	Pattern	Rationale
T1	Single agent, two phases	Must be authoritative; plan phase + accept phase
T2 Lead	Coordinator	Spawned first; defines boundaries + shared assumptions; drives conflict resolution; produces canonical architecture
T2 Specialists	Parallel fan-out	Each works independently within its domain; reads Lead's boundaries + shared assumptions doc before starting
T3	Light mesh	Peer coordination within same T2 domain to negotiate task boundaries before T4 dispatch
T4	Swarm + pipeline hybrid	Independent tasks run as swarm; dependent tasks pipeline (T4-A's output feeds T4-B). T3 declares which is which.
T5	Parallel fan-out + consensus	Each T5 reviews its slice independently, then compares notes for a joint verdict — catches both artifact bugs and integration issues

T2 Flow in Detail

T1 spawns T2 Lead Architect with goal + workstream context
Lead defines explicit domain boundaries (who owns what, hard edges)
Lead publishes shared assumptions doc — cross-cutting concerns, key conventions, architectural constraints (auth approach, data formats, API patterns, etc.)
T1 spawns T2 specialists with boundaries + shared assumptions baked into their briefs
Specialists work in parallel, each within their defined domain
Lead reads all proposals, drives conflict resolution with relevant specialists if needed (cycle limit in config — fixed, not per-workstream)
Lead produces canonical architecture → written to blackboard as distinct artifact
T1 (Accept phase) validates canonical architecture against goal anchor
Canonical architecture becomes T3 briefs — each T2 specialist hands off to its own T3s

Horizontal Scaling Within Tiers

T1 — Phase 1: Plan (self-critique → Andrew approval)
│
├── T2: Lead Architect (boundaries + shared assumptions first)
│   ├── T2: Backend Architect  ─┐
│   ├── T2: Frontend Architect  ├─ parallel, within defined domains
│   └── T2: Infra Architect    ─┘
│       │
│       └── (Lead synthesises → conflict resolution if needed → canonical architecture)
│
├── T2 Backend Architect owns:
│   ├── T3: API Squad Lead  ─┐
│   └── T3: DB Squad Lead   ─┴─ light mesh within domain
│           ├── T4: Worker A  ─┐
│           ├── T4: Worker B  ─┼─ swarm / pipeline (T3 decides)
│           └── T4: Worker C  ─┘
│                   └── T5: Verifier(s) — fan-out + consensus
│
└── T1 — Phase 2: Accept (validates against goal anchor → PR)

Use Case Flows

T1 assesses complexity and prescribes the tier path per workstream. Three standard depth profiles:

Full Stack — T1→T2→T3→T4→T5

Complex feature, new product, cross-domain changes

T1 Plan
  → assess complexity (high)
  → output T1 Plan Schema (workstreams, tier paths [T2,T3,T4,T5], parallelism, retry budgets)
  → self-critique pass
  → GATE: surface to Andrew ← approval required

T2 Lead (spawned by runner after approval)
  → receive: goal + full workplan
  → publish: domain boundaries + shared assumptions doc → blackboard
  → GATE (optional): review boundaries before specialists spawn

T2 Specialists (parallel fan-out, wait on Lead)
  → each receives: their domain boundary + shared assumptions
  → produce: architecture proposal for their slice
  → Lead synthesises, drives conflict resolution if needed
  → Lead writes: canonical architecture → blackboard
  → GATE (recommended): review architecture before implementation

Each T2 Specialist → spawns its own T3s (with canonical architecture slice + interface contracts)

T3s (light mesh within T2 domain)
  → write draft task lists to blackboard
  → read peers' lists, reconcile boundaries
  → commit merged task plan before T4 dispatch
  → GATE (optional): review task breakdown

T4s
  → swarm: independent tasks run in parallel
  → pipeline: T4-A output feeds T4-B (T3 declares dependencies)
  → commit to feature branches

T5s (fan-out per T4 slice)
  → each reviews its slice independently
  → T3 collects results → joint verdict
  → GATE (optional): review T5 verdict before T3 marks done
  → partial: T3 retries only failed slices
  → pass: T3 signals workstream done to T2

T2 specialists → signal T2 Lead
T2 Lead → writes integration summary → blackboard

T1 Accept
  → validate against goal anchor
  → open PR, notify_adapter.send(pr summary + url)

Medium Complexity — T1→T3→T4→T5

Config change, isolated bug fix — T1 determines no cross-domain design needed

T1 Plan
  → assess: contained scope, single domain, no T2 architecture needed
  → workplan: tier paths [T3, T4, T5]
  → GATE: Andrew approval

T3s spawned directly by runner
  → receives T1 brief with task context (no T2 architecture layer)
  → T3 light mesh → T4 dispatch → T5 verify → signal done

T1 Accept → PR

Simple / Hotfix — T1→T4→T5

Single file, single function, trivial atomic task

T1 Plan
  → assess: trivial, single workstream
  → tier path: [T4, T5]
  → GATE: Andrew approval

T4 (coding agent)
  → single atomic task, commits

T5 (single verifier, not full fan-out)
  → code review + correctness check
  → pass → T1 Accept → PR

Resolved Mechanics

T3 Mesh via Blackboard

T3s coordinate task boundaries before dispatching T4s. All coordination goes through the blackboard — no direct agent-to-agent messaging.

Each T3 writes its draft task list to the blackboard (one row per proposed T4 task, status draft)
Each T3 reads all sibling T3 draft lists in its T2 domain
T3s amend their lists to resolve overlaps (claim tasks, release duplicates)
Once all T3s in the domain have committed their final task lists (status committed), T4 dispatch begins
No T3 dispatches T4s until all peers in the domain are committed — this prevents duplicate work

The runner monitors for all_committed state and can enforce a timeout (config: t3_mesh_timeout_minutes).

T1 Plan Output Schema

T1's Plan phase produces a structured JSON object written to the blackboard. The runner parses this to bootstrap the pipeline.

{
  "run_id": "uuid",
  "goal_anchor": "Original goal — immutable, propagated to every downstream brief",
  "complexity": "high | medium | low",
  "retry_budget_multiplier": 2,
  "workstreams": [
    {
      "id": "ws-backend-api",
      "name": "Backend API",
      "domain": "backend",
      "tier_path": ["t2", "t3", "t4", "t5"],
      "parallel_group": "A",
      "t2_specialist": "agents/engineering/engineering-software-architect.md",
      "notes": "Focus on webhook ingest and retry queue"
    }
  ],
  "parallelism": {
    "groups": {
      "A": ["ws-backend-api", "ws-frontend"],
      "B": ["ws-infra"]
    },
    "sequence": ["A", "B"]
  },
  "self_critique_summary": "Brief plain-text summary of what T1 identified and amended in its self-critique pass"
}

parallel_group + sequence handles inter-workstream dependencies: group A runs in parallel, then B starts after A completes.

T5 Consensus & Verdict Schema

T3 aggregates all T5 results into a joint verdict after fan-out completes.

Individual T5 result:

{
  "verifier_id": "uuid",
  "scope": "queue-client",
  "verdict": "pass | fail",
  "issues": ["issue description..."],
  "notes": "human-readable summary"
}

T3 joint verdict (written to blackboard):

{
  "t5_results": [...],
  "joint_verdict": "pass | partial | fail",
  "failed_scopes": ["queue-client"],
  "summary": "Human-readable summary for gate surface and logs"
}

Split verdict handling:

pass → T3 marks workstream done, signals T2
partial → T3 retries only the failed T4 slices (up to retry budget), re-runs T5 on those slices
fail → T3 escalates to T2 (or T1 if shallow path)

Spawn Call Ownership

The runner is the single point of contact with the runtime adapter. Tiers do not call sessions_spawn directly — they write output to the blackboard and the runner acts on it.

Flow:

A tier completes and writes child briefs to the briefs table with status=pending
Runner's spawn loop detects pending rows
If a gate is configured at this tier boundary → runner writes gate_pending, notifies Andrew, halts
On gate_approved → runner calls runtime_adapter.spawn() for each pending brief
Spawned agent runs, writes its own child briefs as pending when done → loop continues

This keeps gate logic in one place (the runner's spawn loop), makes all spawn calls auditable from a single location, and means agents only need blackboard read/write access — no runtime adapter tool access required.

Gate Approval UX

Core mechanic (platform-agnostic):

Runner writes gate_pending to blackboard
Runner calls notify_adapter.send() with tier summary + gate context (run_id, gate, summary, what_happens_next)
Runner polls blackboard for gate_approved or gate_rejected
agency approve <run_id> / agency reject <run_id> --reason "..." writes the event directly to the blackboard — the universal approval path, works on any platform with filesystem access

Runner never reads from a state file, never talks to a notify adapter for inbound responses. It only polls the blackboard.

Adapter responsibility: Each notify adapter handles its own inbound response routing. How a human's approval gets translated into an agency approve CLI call is entirely the adapter's concern — not core. Example: an OpenClaw adapter bridges a chat reply to the CLI. A Slack adapter wires up a slash command. A webhook adapter listens on an endpoint. All produce the same result: gate_approved written to blackboard.

Any internal state the adapter needs to resolve ambiguous responses (e.g. which run_id an approval refers to when multiple gates are pending) is managed by the adapter, not the core.

T3 Mesh Timeout

If T3s in a domain fail to commit their task lists within t3_mesh_timeout_minutes:

Runner escalates to T2 — writes a gate_pending escalation event and notifies the T2 specialist that owns the domain. Context: which T3s timed out, what draft lists (if any) exist on the blackboard. T2 re-scopes or clarifies domain boundaries, spawns fresh T3 briefs.
If T2 also exhausts its retry budget → normal escalation ladder: T2 failure → T1 handles → T1 failure → Andrew gate.

Force-committing partial draft lists (optimistic fallback) is explicitly not done — it hides the boundary problem and produces conflicting or duplicate T4 tasks that fail later with less context.

Path Amendment Mechanism

When a mid-run tier discovers scope that warrants a different tier path than T1 prescribed:

The discovering tier writes a path_amendment event to the blackboard:

{
  "kind": "path_amendment",
  "proposed_by": "t3/ws-backend-api",
  "reason": "Discovered auth dependency requires T2 architectural pass",
  "amendment": {
    "workstream": "ws-backend-api",
    "add_tiers": ["t2"],
    "insert_before": "t3"
  }
}

The runner monitors the events table, detects path_amendment, and sends a system event notification to the relevant higher tier
The higher tier is informed, not blocked — it acknowledges and adjusts its understanding
Amendment is logged on the blackboard for audit; no approval gate required (the next scheduled human gate will surface it)

No agent needs callback plumbing. The runner is the notification bridge.

Shared State

For software pipelines, the repo is the primary blackboard:

T4 workers commit to feature branches
T3 leads review and merge to workstream branches
T2 architects own integration branches
T1 does final integration and acceptance

Supplemented by a SQLite coordination store per run tracking:

In-flight workstreams and their current execution plans
Handoff artifacts and tier status
Retry counts and escalation history
Path amendments (proposed, by whom, timestamp)

Failure Handling

Distributed ownership — each tier handles failures in the tier below it. The runner only handles T1 failure and terminal human escalation.

Failure	Owner	Handler	Action
T4 bad output	T3	`escalation.py` called by T3's context	Retry T4 with corrected brief (up to retry_budget)
T4 blocked	T3	`escalation.py`	Escalate to T3 immediately — no retries
T4 partial output	T3	`escalation.py`	Salvage good parts, re-task remainder
T5 partial verdict	T3	T3 joint verdict logic	Retry failed T4 slices only
T5 full fail	T3	T3 joint verdict logic	Escalate to T2
T3 workstream stuck	T2	T2 specialist prompt + blackboard	Re-scope or split the workstream
T2 design wrong	T1	T1 Accept phase + blackboard	Re-plan; may discard workstream and restart
T1 failure / crash	Runner	`team_runner.py`	Surface to human, halt run
Repeated escalation	Runner	`team_runner.py`	Gate: block until human unblocks

Key distinction: escalation.py is not called by the runner centrally. It is logic that tier agents execute (or the runner executes on their behalf when it detects a timeout or dead agent). The runner only owns the last two rows.

Retry limits prevent infinite loops. Escalation path is always upward, never sideways.

T1 sets a retry budget multiplier during scope assessment (1x simple, 2x complex). Retry budget is a field on the task brief — not hardcoded in the runner.

Agent Talent Pool

The system builds on agency-agents — a library of 50+ pre-built specialist personalities, each with deep domain expertise, quality standards, and specific deliverables.

Division of responsibility:

Our system provides: orchestration, tier structure, task briefs, retries, verification gates, shared state
Agency-agents provides: the specialist knowledge each agent brings to its role

T1 selects the right specialist from the roster when building workstream briefs. The specialist's personality is injected as the system prompt at spawn time.

Default tier-to-specialist mapping for software pipelines:

Tier	Domain	Agent
T1	Strategy	nexus-strategy
T2	Backend	software-architect
T2	Infra	devops-automator
T2	Data	data-engineer
T3	Backend	senior-developer
T3	Reliability	sre
T4	Frontend	frontend-developer
T4	Backend	backend-architect
T4	Database	database-optimizer
T4	DevOps	devops-automator
T4	Mobile	mobile-app-builder
T4	AI/ML	ai-engineer
T4	Security	security-engineer
T4	Docs	technical-writer
T5	Code review	code-reviewer
T5	Integration	testing-reality-checker
T5	API	testing-api-tester
T5	Performance	testing-performance-benchmarker
T5	Security	security-engineer

The roster is not fixed — T1 can select any agent from the library based on workstream needs.

Adapter Layers

Everything external is a swappable adapter. Core logic never imports from adapters directly — always through an interface.

Core (platform-agnostic)
├── team_runner      — thin bootstrap: spawn T1, monitor blackboard, handle result
├── blackboard       — SQLite coordination state
├── task_brief       — schema + validation
└── escalation       — retry logic, failure routing

Adapters (swappable)
├── llm/             — anthropic (now), openai, ollama, any API
├── notify/          — openclaw (now), slack, email, webhook...
├── vcs/             — github (now), gitlab, gitea, bare git...
└── runtime/
    ├── standard     — openclaw sessions_spawn (T1/T2/T3)
    └── coding_agent — claude_code (T4/T5 default), codex, aider...

Swapping providers means writing a new adapter file — nothing in core changes.

T4 and T5 default to the coding agent runtime when available. Falls back to standard runtime gracefully if not configured.

Run Visibility Layer

Designed for debugging, test runs, and quality evaluation at each tier. Three interlocking components.

1. Human-Readable Live Log

Structured events from the blackboard rendered as a timestamped, readable stream. agency watch <run_id> tails this live.

[abc123] 12:30:01  T1   PLAN_START    Assessing scope: "Build webhook ingestion system"
[abc123] 12:30:14  T1   PLAN_DONE     3 workstreams — backend-api, infra, docs (2 parallel)
[abc123] 12:30:14  GATE APPROVAL      ⏸  Waiting on approval before T2 spawns
[abc123] 12:31:02  GATE APPROVED      ✓  Approved — continuing
[abc123] 12:31:03  T2   LEAD_START    Lead Architect spawned
[abc123] 12:31:41  T2   BOUNDS_READY  Domain boundaries + shared assumptions published
[abc123] 12:31:42  T2   SPEC_START    3 specialists spawned (parallel): backend, infra, docs
[abc123] 12:32:15  T2   SPEC_DONE     backend-api architecture draft ready
[abc123] 12:32:58  T2   SYNTH_DONE    Canonical architecture written to blackboard
[abc123] 12:32:58  GATE INSPECTION    ⏸  T2 synthesis ready for review
[abc123] 12:33:44  T3   MESH_START    backend-api: 2 squad leads negotiating task boundaries
[abc123] 12:34:01  T3   MESH_DONE     Task split committed — 7 T4 tasks (5 swarm, 2 pipeline)
[abc123] 12:34:02  T4   SWARM_START   5 workers spawned in parallel
[abc123] 12:35:10  T4   DONE          worker-3 auth-middleware ✓
[abc123] 12:35:22  T4   FAIL          worker-4 queue-client ✗  (retry 1/3)
[abc123] 12:36:04  T4   DONE          worker-4 queue-client ✓  (retry resolved)
[abc123] 12:36:05  T5   VERIFY_START  4 verifiers spawned
[abc123] 12:36:45  T5   VERDICT       partial — queue-client needs rework
[abc123] 12:37:12  T5   VERDICT       ✓  all pass — workstream backend-api done

Log level verbose adds per-T4-start/done lines. Default is normal (tier-level events only).

2. Inspection Gates

Configurable pause points. When the runner hits a gate, it:

Writes a gate_pending event to the blackboard
Fires notify_adapter.send() with the tier summary + gate context
Halts — no next tier spawns until gate_approved or gate_rejected is written

The tier summary surfaced at each gate includes:

What was produced (the tier artifact in readable form)
What happens next (which agents will spawn, doing what)
Any anomalies flagged by the tier itself

Configurable in team.yaml under visibility.inspection_gates. A strict_mode: true flag enables all gates — recommended for first runs on a new codebase or new goal type.

visibility:
  strict_mode: false
  log_level: normal           # normal | verbose
  inspection_gates:
    t1_plan: true             # always — required by design
    t2_lead: false            # optional — review boundaries before specialists
    t2_synthesis: true        # recommended — review architecture before implementation
    t3_plan: false            # verbose — useful early on, disable once T3 is trusted
    t5_verdict: false         # review T5 joint verdict before T3 marks workstream done
  gate_timeout_minutes: 60    # auto-reject if no response within this window

3. Inspection CLI — `cli/agency.py`

agency run <config.yaml>               # start a run, returns run_id
agency watch <run_id>                  # tail live log (follows blackboard events)
agency inspect <run_id>                # interactive tree view of run state
agency inspect <run_id> --tier t2      # jump to T2 artifacts
agency inspect <run_id> --brief <id>   # show full brief + result JSON

agency approve <run_id>                # approve current gate → continue
agency approve <run_id> --note "..."   # approve with a note written to blackboard
agency reject <run_id> --reason "..."  # reject → tier re-invoked
agency pause <run_id>                  # force-pause at next tier boundary
agency resume <run_id>                 # release a manual pause

agency inspect (no flags) renders a live tree:

Run abc123 — "Build webhook ingestion system"
├── T1 Plan ✓
│   └── [view workplan]
├── T2 Architecture ✓  [GATE: pending review]
│   ├── [view domain boundaries]
│   ├── [view shared assumptions]
│   └── [view canonical architecture]
├── T3 backend-api (active)
│   ├── [view task breakdown]
│   └── T4 workers: 3/7 done, 1 retrying, 3 pending
└── T3 infra (pending)

Blackboard Event Vocabulary (extended)

# existing
"spawned" | "completed" | "failed" | "escalated" | "retried"

# new — visibility layer
"gate_pending"     # runner hit a gate, waiting for human
"gate_approved"    # human approved, run continues
"gate_rejected"    # human rejected, tier re-invoked
"gate_paused"      # manual pause via CLI
"gate_resumed"     # manual resume via CLI
"path_amendment"   # mid-run tier proposed path change
"log"              # human-readable log line (level + message)

Decisions Log

T1 dynamic dispatch — T1 assesses scope and prescribes tier path and workstream parallelism. It does not prescribe internal tier coordination patterns.

T1 two-phase lifecycle — T1 has two explicit named phases: Plan and Accept. Plan phase includes self-critique (single pass) then human approval gate before T2s spawn. Accept phase validates final output against goal anchor. Both phases tracked on blackboard with distinct prompts.

T1 self-critique — Single pass only. Diminishing returns on multiple self-critique iterations; the human review after is the real safety net. Self-critique catches obvious gaps; Andrew catches strategic ones.

Distributed ownership — Each tier owns the layer below it. Runner is thin. Tradeoff: distributed control makes the system extensible but debugging requires good blackboard tooling, not central runner traces.

T5 always mandatory — No skipping verification. Things should work and work well before surfacing to T1.

T3 owns T4 and T5 — T3 manages its T4s (dependency graph, swarm vs pipeline, parallelism) and commissions T5 verification of T4 outputs. Runner does not orchestrate T4/T5 centrally.

T2 Lead Architect — Dedicated T2 role, not a new tier. Spawned first by T1. Owns: domain boundary definition, shared assumptions doc, conflict resolution between specialists, canonical architecture synthesis. Specialists spawn after Lead publishes boundaries + assumptions. Each T2 specialist owns its own T3s — no T3 spans T2 domains.

T2 conflict resolution — Lead sends targeted briefs back to conflicting specialists. Cycle limit is a fixed config value (not per-workstream). Single T1 self-critique parallel: fixed limit, not variable.

T2 shared assumptions — Lead publishes cross-cutting concerns (auth, data formats, API conventions, etc.) before specialists start. Specialists design with shared baseline; implicit dependencies pre-empted rather than caught in synthesis.

Orchestration patterns — Baked into tier prompts and runner tier-handling logic, not prescribed by T1. T2: Lead + parallel specialists. T3: light mesh within T2 domain. T4: swarm+pipeline. T5: fan-out+consensus.

Output / review — Nothing merges to main without explicit human approval. T1 opens a PR and fires notify_adapter.send() with the PR summary. Merge is gated on human sign-off. The notify adapter implementation determines how the notification is delivered.

Platform agnosticism — Core is provider and platform agnostic. Capability levels (reasoning-heavy, capable, fast-cheap) map to models in config. Mixing providers across tiers is supported.

LLM provider — Anthropic first implementation. Config supports per-tier provider selection.

Gateway modification — Decided against. Agent-teams stays standalone Python. OpenClaw used via runtime adapter only.

Coding agent runtime — Claude Code is default T4/T5 runtime. Opt-in native_teams flag available for internal Claude Code parallelism — faster but less blackboard visibility. Default false.

Agency-agents integration — Via git submodule at agents/. T1 selects specialists via config/role_registry.yaml. agent_personality field on task brief; runtime injects as system prompt at spawn time.

Spawn call ownership — Runner is the single point of contact with the runtime adapter. Tiers write status=pending child briefs to the blackboard; runner's spawn loop detects and spawns them. Gate logic (hold on gate_pending) lives in the spawn loop — no gate plumbing needed in agents. Agents only need blackboard read/write access.

Gate approval UX — agency approve <run_id> CLI is the universal approval path — writes gate_approved directly to blackboard. Runner only polls blackboard; it does not depend on any specific notification platform. Each notify adapter handles its own inbound response bridge as internal adapter state. Core has no pending_gates.json or platform-specific approval logic.

T3 mesh timeout — Escalate to T2 (the specialist that owns the domain). Timeout means T3s can't agree on task boundaries — a domain boundary problem T2 should fix by re-scoping. If T2 exhausts its retry budget, normal escalation ladder handles it (T1 → Andrew gate). No force-commit fallback.

T3 mesh mechanics — Blackboard-based coordination. T3s write draft task lists, read peers', reconcile overlaps, commit merged plan. No T4 dispatch until all T3s in the domain have committed. Runner enforces timeout (t3_mesh_timeout_minutes in config). Chosen over designated T3 lead or direct messaging — fits distributed ownership model, gives full audit trail for free.

T1 output schema — Formal JSON schema defined (2026-03-30). Fields: run_id, goal_anchor, complexity, retry_budget_multiplier, workstreams[] (id, name, domain, tier_path, parallel_group, t2_specialist, notes), parallelism (groups + sequence), self_critique_summary. parallel_group + sequence handles inter-workstream dependencies.

T5 consensus — T3 aggregates all T5 results into joint verdict: pass | partial | fail. Split verdict (partial) → T3 retries only failed slices, re-runs T5 on those slices. Full fail escalates to T2. T3 writes structured joint verdict to blackboard; this is what the optional T5 gate surfaces to Andrew.

Path amendment mechanism — Amending tier writes path_amendment event to blackboard (structured JSON: proposed_by, reason, amendment). Runner monitors events table, sends system event notification to relevant higher tier. Higher tier is informed, not blocked. No agent callback plumbing. Amendments surface at next scheduled human gate.

Failure handling (distributed) — Confirmed distributed ownership (2026-03-30). escalation.py is logic tiers execute (or runner executes on tier's behalf on timeout/crash), not a central runner concern. Runner only owns: T1 failure, terminal human escalation. See updated Failure Handling table.

Run visibility layer — Added 2026-03-30. Human-readable live log, configurable inspection gates, and cli/agency.py inspection/control commands. Designed for debugging and quality evaluation at each tier during early runs. strict_mode: true enables all gates. Gates surface tier artifacts + "what happens next" summary via notify_adapter.send() — platform-agnostic. Resolves Q3 (T5 consensus surfaces as gate event with human-readable summary). T5 gate (optional) lets the operator review joint verdict before T3 marks workstream done.

34 KiB Raw Blame History