architecture

February 8, 2026

Forge: Agentic SDLC Orchestrator — System Design

An opinionated, buildable design for an AI-driven software development lifecycle tool. Distilled from 13 research topics, 4 synthesis documents, and 70+ interface specifications.

1. What We're Building

Forge is a CLI tool and TypeScript library that orchestrates AI agents through the software development lifecycle. You give it a task — a feature, bug fix, or refactor — and it plans, implements, reviews, tests, and deploys the change. Humans stay in the loop for high-stakes decisions. The system learns from every execution.

It is not a distributed microservice platform. It's a single Bun process backed by SQLite that coordinates LLM calls, tool executions, and human checkpoints through a pipeline.

Design Principles

Start simple, earn complexity — Sequential pipeline first, parallel agents later
Learn from everything — Every execution feeds the memory system
Safe by default — Circuit breakers and human gates baked in, not bolted on
Observable — Every decision logged with rationale, every action attributed
Tool-agnostic — Swap LLM providers, CI systems, or git hosts without rewriting agents

2. Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                          CLI / API                                │
│  forge run "add user auth"    forge review PR#42    forge test    │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│                       ORCHESTRATOR                                │
│                                                                   │
│   Pipeline: Plan → Implement → Review → Test → Deploy             │
│   State Machine  ·  Checkpoints  ·  Human Gates                   │
│                                                                   │
│   ┌─────────────────────────────────────────────────────────┐    │
│   │                    EVENT BUS                              │    │
│   │  emit()  ·  on()  ·  replay()  ·  snapshot()             │    │
│   └─────────────────────────────────────────────────────────┘    │
└───────┬──────────┬──────────┬──────────┬──────────┬──────────────┘
        │          │          │          │          │
        ▼          ▼          ▼          ▼          ▼
   ┌─────────┐┌─────────┐┌─────────┐┌─────────┐┌─────────┐
   │ Planner ││Implement││Reviewer ││ Tester  ││Deployer │
   │  Agent  ││  Agent  ││  Agent  ││  Agent  ││  Agent  │
   └────┬────┘└────┬────┘└────┬────┘└────┬────┘└────┬────┘
        │          │          │          │          │
        └──────────┴──────────┴──────────┴──────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│                       TOOL LAYER                                  │
│                                                                   │
│   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐        │
│   │ LLM  │ │ Git  │ │GitHub│ │ Test │ │ Lint │ │ Shell│        │
│   │Client│ │  Ops │ │  API │ │Runner│ │/Fmt  │ │ Exec │        │
│   └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘        │
└──────────────────────────────┬───────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────────┐
│                       MEMORY LAYER                                │
│                                                                   │
│   ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐    │
│   │ Episodic  │  │ Semantic  │  │Procedural │  │  Events   │    │
│   │ (what     │  │ (patterns │  │ (how to   │  │ (audit    │    │
│   │  happened)│  │  & facts) │  │  do stuff)│  │  trail)   │    │
│   └───────────┘  └───────────┘  └───────────┘  └───────────┘    │
│                                                                   │
│                     SQLite via Drizzle ORM                        │
└──────────────────────────────────────────────────────────────────┘

3. Core Abstractions

Everything in the system is built on 6 types. Every other type composes these.

typescript
// ─── The Agent Loop ───────────────────────────────────────
// Every agent follows the same cycle: perceive → reason → act → learn.
// The orchestrator runs agents. Agents run tools. Tools do work.

interface Agent {
  id: string;
  type: AgentType;

  /** Run one cycle of the agent loop */
  execute(input: PhaseInput, ctx: AgentContext): Promise<PhaseOutput>;
}

// ─── The Event ────────────────────────────────────────────
// Everything that happens is an event. Events are the source of truth.
// The memory system, audit trail, and observability all consume events.

interface ForgeEvent {
  id: string;
  traceId: string;       // Groups events in one pipeline run
  timestamp: Date;
  source: string;         // Which agent/component emitted this
  type: string;           // Dot-namespaced: "review.finding", "test.failed"
  payload: unknown;
  cost?: { tokens: number; usd: number };
}

// ─── The Tool ─────────────────────────────────────────────
// Tools are the hands of agents. An agent reasons about what to do,
// then executes a tool to do it. Tools are sandboxed and audited.

interface Tool<TInput = unknown, TOutput = unknown> {
  name: string;
  description: string;
  schema: { input: ZodSchema<TInput>; output: ZodSchema<TOutput> };
  execute(input: TInput, ctx: ToolContext): Promise<TOutput>;
}

// ─── The Phase ────────────────────────────────────────────
// The pipeline is a sequence of phases. Each phase has an agent,
// input/output types, and safety controls.

interface Phase {
  name: PhaseName;
  agent: Agent;
  guards: Guard[];          // Pre-conditions to enter this phase
  gates: HumanGate[];       // Human approval checkpoints
  breakers: CircuitBreaker[];  // Safety limits
  next: PhaseName | null;
}

// ─── The Memory ───────────────────────────────────────────
// Memories are what the system learned. They have types, relevance
// scores, and decay over time if not reinforced.

interface Memory {
  id: string;
  type: 'episodic' | 'semantic' | 'procedural';
  content: string;
  embedding?: Float32Array;  // For similarity search
  confidence: number;        // 0-1, decays without reinforcement
  context: string;           // When is this memory relevant?
  createdAt: Date;
  lastAccessed: Date;
  accessCount: number;
}

// ─── The Checkpoint ───────────────────────────────────────
// Checkpoints save pipeline state between phases. If something fails,
// we can resume from the last checkpoint instead of starting over.

interface Checkpoint {
  id: string;
  traceId: string;
  phase: PhaseName;
  state: Record<string, unknown>;  // Serialized phase outputs so far
  timestamp: Date;
}

4. Module Map

forge/
│
├── src/
│   ├── core/                    # Foundation — build this first
│   │   ├── types.ts             # Core types from Section 3
│   │   ├── bus.ts               # In-memory event bus
│   │   ├── config.ts            # Runtime configuration + defaults
│   │   └── errors.ts            # Error taxonomy (source, severity, recoverability)
│   │
│   ├── safety/                  # Guardrails — build alongside core
│   │   ├── breakers.ts          # Circuit breakers (iteration, cost, time, error-rate)
│   │   ├── gates.ts             # Human approval gates
│   │   └── budget.ts            # Cost tracking and limits
│   │
│   ├── memory/                  # Learning foundation — Week 1-2
│   │   ├── schema.ts            # Drizzle SQLite schema
│   │   ├── store.ts             # Memory CRUD + similarity search
│   │   ├── episodes.ts          # Episodic memory (what happened)
│   │   ├── patterns.ts          # Semantic memory (pattern extraction)
│   │   ├── procedures.ts        # Procedural memory (strategies that work)
│   │   └── consolidate.ts       # Knowledge consolidation + pruning
│   │
│   ├── tools/                   # Tool layer — Week 1-2
│   │   ├── registry.ts          # Tool registry + discovery
│   │   ├── sandbox.ts           # Execution sandboxing
│   │   ├── llm.ts               # LLM provider abstraction
│   │   ├── git.ts               # Git operations
│   │   ├── github.ts            # GitHub API (PRs, reviews, webhooks)
│   │   ├── runner.ts            # Shell command execution
│   │   ├── linter.ts            # ESLint/Biome integration
│   │   └── test-runner.ts       # Jest/Vitest execution + parsing
│   │
│   ├── agents/                  # Agent implementations — Week 3+
│   │   ├── base.ts              # Base agent with loop, reflection, safety
│   │   ├── planner.ts           # Requirements → architecture → tasks
│   │   ├── implementer.ts       # Tasks → code
│   │   ├── reviewer.ts          # Code → findings + risk score
│   │   ├── tester.ts            # Code → tests → results → analysis
│   │   └── deployer.ts          # Artifact → canary → rollout
│   │
│   ├── orchestrator/            # Pipeline coordination — Week 7-8
│   │   ├── pipeline.ts          # Phase sequencing state machine
│   │   ├── checkpoint.ts        # State persistence between phases
│   │   └── context.ts           # Shared context across agents
│   │
│   └── cli/                     # User interface
│       ├── index.ts             # CLI entry point
│       ├── commands/            # run, review, test, status, etc.
│       └── ui.ts                # Terminal output formatting
│
├── drizzle/                     # Database migrations
├── forge.config.ts              # Project-level configuration
└── package.json

5. Data Model (SQLite / Drizzle)

This is the ground truth. Everything the system knows lives here.

typescript
// ─── schema.ts ────────────────────────────────────────────

import { sqliteTable, text, integer, real, blob } from 'drizzle-orm/sqlite-core';

// ─── Events (Append-only audit trail) ─────────────────────
export const events = sqliteTable('events', {
  id:        text('id').primaryKey(),            // ulid
  traceId:   text('trace_id').notNull(),         // groups one pipeline run
  timestamp: integer('timestamp', { mode: 'timestamp_ms' }).notNull(),
  source:    text('source').notNull(),           // agent or component id
  type:      text('type').notNull(),             // "plan.started", "review.finding", etc.
  phase:     text('phase'),                      // current pipeline phase
  payload:   text('payload', { mode: 'json' }),  // event-specific data
  tokensUsed:integer('tokens_used'),
  costUsd:   real('cost_usd'),
  durationMs:integer('duration_ms'),
});

// ─── Memories (What the system has learned) ───────────────
export const memories = sqliteTable('memories', {
  id:          text('id').primaryKey(),
  type:        text('type').notNull(),           // episodic | semantic | procedural
  content:     text('content').notNull(),        // Human-readable description
  context:     text('context').notNull(),        // When is this relevant?
  embedding:   blob('embedding'),                // Float32Array for similarity search
  confidence:  real('confidence').notNull(),      // 0.0 - 1.0
  source:      text('source'),                   // What event created this?
  tags:        text('tags', { mode: 'json' }),   // ["typescript", "testing", "auth"]
  createdAt:   integer('created_at', { mode: 'timestamp_ms' }).notNull(),
  lastAccessed:integer('last_accessed', { mode: 'timestamp_ms' }).notNull(),
  accessCount: integer('access_count').notNull().default(0),
});

// ─── Patterns (Extracted from episodes) ───────────────────
export const patterns = sqliteTable('patterns', {
  id:          text('id').primaryKey(),
  type:        text('type').notNull(),           // success | failure | approach
  trigger:     text('trigger').notNull(),        // What situation activates this?
  pattern:     text('pattern').notNull(),        // The pattern itself
  resolution:  text('resolution'),               // What to do when triggered
  frequency:   integer('frequency').notNull().default(1),
  successRate: real('success_rate'),             // How often this works
  confidence:  real('confidence').notNull(),
  lastSeen:    integer('last_seen', { mode: 'timestamp_ms' }).notNull(),
});

// ─── Checkpoints (Pipeline state snapshots) ───────────────
export const checkpoints = sqliteTable('checkpoints', {
  id:        text('id').primaryKey(),
  traceId:   text('trace_id').notNull(),
  phase:     text('phase').notNull(),
  state:     text('state', { mode: 'json' }).notNull(),
  timestamp: integer('timestamp', { mode: 'timestamp_ms' }).notNull(),
});

// ─── Runs (Pipeline execution history) ────────────────────
export const runs = sqliteTable('runs', {
  id:          text('id').primaryKey(),           // = traceId
  task:        text('task').notNull(),            // Human description of what was requested
  status:      text('status').notNull(),          // pending | running | completed | failed | cancelled
  currentPhase:text('current_phase'),
  config:      text('config', { mode: 'json' }), // Runtime config snapshot
  startedAt:   integer('started_at', { mode: 'timestamp_ms' }).notNull(),
  completedAt: integer('completed_at', { mode: 'timestamp_ms' }),
  totalCostUsd:real('total_cost_usd').default(0),
  totalTokens: integer('total_tokens').default(0),
  error:       text('error'),                     // If failed, why
});

// ─── Findings (Review/test issues) ────────────────────────
export const findings = sqliteTable('findings', {
  id:         text('id').primaryKey(),
  runId:      text('run_id').notNull(),
  phase:      text('phase').notNull(),            // review | test
  severity:   text('severity').notNull(),         // info | warning | error | critical
  category:   text('category').notNull(),         // style | security | correctness | performance
  message:    text('message').notNull(),
  file:       text('file'),
  line:       integer('line'),
  confidence: real('confidence'),
  fixable:    integer('fixable', { mode: 'boolean' }),
  fix:        text('fix'),                        // Suggested code change
  dismissed:  integer('dismissed', { mode: 'boolean' }).default(false),
  dismissedBy:text('dismissed_by'),               // Who dismissed and why — for learning
});

6. The Agent Loop

Every agent runs the same core loop. The only thing that changes is the tools available and the reasoning prompt.

  ┌──────────────────────────────────────────────────┐
  │                   AGENT LOOP                      │
  │                                                   │
  │   ┌──────────┐                                   │
  │   │ PERCEIVE │ ← Gather context:                 │
  │   │          │   - Task/phase input               │
  │   │          │   - Relevant memories              │
  │   │          │   - Previous iteration results     │
  │   └────┬─────┘                                   │
  │        ▼                                          │
  │   ┌──────────┐                                   │
  │   │  REASON  │ ← LLM decides:                   │
  │   │          │   - What tool to use next          │
  │   │          │   - Or: task is complete            │
  │   │          │   - Or: need human input            │
  │   └────┬─────┘                                   │
  │        ▼                                          │
  │   ┌──────────┐                                   │
  │   │   ACT    │ ← Execute tool:                   │
  │   │          │   - Validate input (Zod)           │
  │   │          │   - Run in sandbox                 │
  │   │          │   - Capture result + metrics        │
  │   └────┬─────┘                                   │
  │        ▼                                          │
  │   ┌──────────┐                                   │
  │   │  LEARN   │ ← After each iteration:           │
  │   │          │   - Log event to bus               │
  │   │          │   - Check circuit breakers          │
  │   │          │   - Update working memory           │
  │   │          │   - Reflect if error occurred        │
  │   └────┬─────┘                                   │
  │        │                                          │
  │        ├── Continue? ──▶ Loop back to PERCEIVE    │
  │        ├── Done? ──────▶ Return PhaseOutput       │
  │        ├── Stuck? ─────▶ Escalate to human        │
  │        └── Breaker? ───▶ Halt with error          │
  └──────────────────────────────────────────────────┘

typescript
// ─── base.ts ──────────────────────────────────────────────

abstract class BaseAgent implements Agent {
  abstract type: AgentType;
  abstract tools: Tool[];
  abstract systemPrompt: string;

  async execute(input: PhaseInput, ctx: AgentContext): Promise<PhaseOutput> {
    let iteration = 0;
    let workingMemory = await this.perceive(input, ctx);

    while (true) {
      iteration++;

      // ── Safety check ──
      const breakerResult = ctx.safety.check({ iteration, cost: ctx.cost, elapsed: ctx.elapsed });
      if (breakerResult.shouldBreak) {
        ctx.bus.emit({ type: `${this.type}.breaker_tripped`, payload: breakerResult });
        throw new CircuitBreakerError(breakerResult);
      }

      // ── Reason: ask LLM what to do ──
      const decision = await ctx.llm.chat({
        system: this.systemPrompt,
        messages: workingMemory.messages,
        tools: this.tools.map(t => t.schema),
      });

      // ── Done? ──
      if (decision.done) {
        const output = decision.result as PhaseOutput;
        ctx.bus.emit({ type: `${this.type}.completed`, payload: output });
        await this.reflect(ctx, 'success');
        return output;
      }

      // ── Act: execute the chosen tool ──
      const tool = this.tools.find(t => t.name === decision.toolCall.name);
      const result = await this.executeTool(tool, decision.toolCall.input, ctx);

      // ── Learn: update context ──
      workingMemory = this.updateWorkingMemory(workingMemory, decision, result);

      if (result.error) {
        await this.reflect(ctx, 'error', result.error);
      }
    }
  }

  private async perceive(input: PhaseInput, ctx: AgentContext): Promise<WorkingMemory> {
    const relevantMemories = await ctx.memory.recall({
      context: input.task,
      type: this.type,
      limit: 10,
    });

    return {
      messages: [
        { role: 'user', content: this.buildPrompt(input, relevantMemories) },
      ],
    };
  }

  private async reflect(ctx: AgentContext, outcome: string, error?: Error) {
    // Post-execution reflection: extract learnings
    const reflection = await ctx.llm.chat({
      system: REFLECTION_PROMPT,
      messages: [{
        role: 'user',
        content: `Outcome: ${outcome}. ${error ? `Error: ${error.message}` : ''}\nWhat should we remember for next time?`
      }],
    });

    if (reflection.learnings) {
      for (const learning of reflection.learnings) {
        await ctx.memory.store({
          type: 'procedural',
          content: learning.content,
          context: learning.context,
          confidence: learning.confidence,
        });
      }
    }
  }
}

7. Agent Designs

7.1 Planner Agent

Input: Natural language task description Output: Implementation plan with architecture, tasks, risk assessment Tools: read_file, glob, grep, llm_analyze

"Add user authentication"
        │
        ▼
  ┌─ PERCEIVE ─┐
  │ Read existing codebase structure         │
  │ Recall patterns for "auth" from memory   │
  │ Check for existing auth utilities         │
  └──────┬──────┘
         ▼
  ┌── REASON ──┐
  │ Decompose into tasks:                    │
  │  1. Design auth schema                   │
  │  2. Create login/register endpoints      │
  │  3. Add session middleware               │
  │  4. Protect routes                       │
  │ Estimate risk: MEDIUM (new feature)      │
  │ Identify dependencies                    │
  └──────┬──────┘
         ▼
  OUTPUT: ImplementationPlan {
    architecture: { components, interfaces, decisions }
    tasks: Task[]          // ordered, with dependencies
    risk: RiskAssessment   // determines review depth
    estimates: { complexity, effort }
  }

7.2 Implementer Agent

Input: Implementation plan + task list Output: Code changes (files modified/created) Tools: read_file, write_file, run_command, llm_generate, search_code

ImplementationPlan
        │
        ▼
  For each task (sequential MVP, parallel later):
  ┌─ PERCEIVE ─┐
  │ Read target files                         │
  │ Understand existing patterns              │
  │ Load procedural memories for this domain  │
  └──────┬──────┘
         ▼
  ┌── REASON ──┐
  │ Generate code change                     │
  │ Self-validate: does this match the spec? │
  │ Check for obvious issues                 │
  └──────┬──────┘
         ▼
  ┌─── ACT ────┐
  │ Write files                              │
  │ Run typecheck                            │
  │ Run affected tests                       │
  │ Fix issues if found, loop back           │
  └──────┬──────┘
         ▼
  OUTPUT: CodeChanges {
    files: FileChange[]    // path, before, after
    testsAdded: string[]   // new test files
    validated: boolean     // typecheck + tests pass
  }

7.3 Reviewer Agent

Input: Code changes (diff) Output: Review with findings, risk score, gate decision Tools: run_linter, run_security_scan, llm_review, read_file

CodeChanges
        │
        ▼
  ┌─ Layer 1: Static Analysis ──┐  (fast, cheap, deterministic)
  │ ESLint / Biome               │
  │ TypeScript strict check      │
  │ Formatting check             │
  └──────────┬───────────────────┘
             ▼
  ┌─ Layer 2: Security Scan ────┐  (fast, important)
  │ Secret detection             │
  │ Dependency vulnerability     │
  │ Known insecure patterns      │
  └──────────┬───────────────────┘
             ▼
  ┌─ Layer 3: AI Review ────────┐  (slow, expensive — only if risk > low)
  │ Logic correctness            │
  │ Edge cases                   │
  │ Performance implications     │
  │ Architecture fit             │
  └──────────┬───────────────────┘
             ▼
  ┌─ Synthesis ─────────────────┐
  │ Deduplicate findings         │
  │ Calculate risk score         │
  │ Determine gate decision      │
  └──────────┬───────────────────┘
             ▼
  OUTPUT: ReviewResult {
    findings: Finding[]
    riskScore: { total, level: 'low'|'medium'|'high'|'critical' }
    decision: 'approve' | 'request_changes' | 'require_human'
  }

Risk-based review depth:

Risk Level	Static	Security	AI Review	Human Required
Low	Yes	Yes	No	No
Medium	Yes	Yes	Yes	Optional
High	Yes	Yes	Yes	Yes
Critical	Yes	Yes	Yes	Yes + Architect

7.4 Tester Agent

Input: Code changes + existing test suite Output: Test results + failure analysis + coverage Tools: run_tests, llm_analyze, read_file, write_file

CodeChanges
        │
        ▼
  ┌─ Select ────────────────────┐
  │ Which tests to run?          │
  │  - Tests covering changed files (always)
  │  - Related integration tests (if medium+ risk)
  │  - Full suite (if high+ risk)
  └──────────┬───────────────────┘
             ▼
  ┌─ Execute ───────────────────┐
  │ Run selected tests           │
  │ Retry failures once (flaky?) │
  │ Collect coverage             │
  └──────────┬───────────────────┘
             ▼
  ┌─ Analyze ───────────────────┐
  │ If failures:                 │
  │  - Classify: real bug vs flaky vs env issue
  │  - Root cause analysis (LLM)
  │  - Suggest fix              │
  │ If low coverage:            │
  │  - Identify gaps            │
  │  - Generate missing tests   │
  └──────────┬───────────────────┘
             ▼
  OUTPUT: TestResult {
    summary: { total, passed, failed, skipped }
    coverage: { line, branch, function, diff }
    failures: FailureAnalysis[]    // with root cause + fix
    generatedTests: TestFile[]     // new tests if gaps found
  }

7.5 Deployer Agent

Input: Validated code + test results Output: Deployment status Tools: run_command, github_api, read_file

ValidatedCode + TestResults
        │
        ▼
  ┌─ Build ─────────────────────┐
  │ Run build command            │
  │ Verify artifact              │
  └──────────┬───────────────────┘
             ▼
  ┌─ Gate: Human Approval ──────┐  (always for production)
  │ Show summary:                │
  │  - What changed              │
  │  - Risk score                │
  │  - Test results              │
  │  - Findings                  │
  │ Wait for approval            │
  └──────────┬───────────────────┘
             ▼
  ┌─ Deploy ────────────────────┐
  │ Strategy based on risk:      │
  │  - Low: direct deploy        │
  │  - Medium: canary (5%→25%→100%)
  │  - High: canary (5%→10%→25%→50%→100%)
  └──────────┬───────────────────┘
             ▼
  ┌─ Verify ────────────────────┐
  │ Health check endpoints       │
  │ Error rate vs baseline       │
  │ Latency vs baseline          │
  │ Auto-rollback if unhealthy   │
  └──────────┬───────────────────┘
             ▼
  OUTPUT: DeploymentResult {
    status: 'healthy' | 'degraded' | 'rolled_back'
    metrics: { errorRate, latency, throughput }
    url: string
  }

8. Safety System

8.1 Circuit Breakers

Four breakers run continuously. Any one can halt execution.

typescript
interface SafetyConfig {
  breakers: {
    iteration: {
      default: 10,
      planning: 20,
      implementation: 50,
      testing: 5,
      deployment: 3,
      stagnationThreshold: 3,         // Consecutive iterations with no progress
    };
    cost: {
      perPhase: {                      // USD
        planning: 5,
        implementation: 10,
        review: 2,
        testing: 3,
        deployment: 2,
      },
      perRun: 50,
      perDay: 200,
    };
    time: {                            // milliseconds
      planning: 30 * 60_000,          // 30 min
      implementation: 60 * 60_000,    // 1 hour
      review: 30 * 60_000,            // 30 min
      testing: 20 * 60_000,           // 20 min
      deployment: 15 * 60_000,        // 15 min
      totalPipeline: 120 * 60_000,    // 2 hours
    };
    errorRate: {
      window: 5 * 60_000,             // 5 minute sliding window
      warning: 0.10,                   // 10%
      critical: 0.25,                  // 25% → halt
    };
  };
}

8.2 Human Gates

typescript
// Gates are checkpoints where execution pauses for human input.
// They fire based on conditions, not every time.

const GATES: HumanGate[] = [
  {
    id: 'architecture_approval',
    phase: 'planning',
    condition: (plan) => plan.risk.level === 'high' || plan.risk.level === 'critical',
    prompt: 'Review proposed architecture before implementation begins.',
    timeout: 24 * 60 * 60_000,       // 24 hours
  },
  {
    id: 'production_deploy',
    phase: 'deployment',
    condition: (ctx) => ctx.environment === 'production',
    prompt: 'Approve production deployment.',
    timeout: 60 * 60_000,            // 1 hour
  },
  {
    id: 'security_findings',
    phase: 'review',
    condition: (review) => review.findings.some(f => f.severity === 'critical' && f.category === 'security'),
    prompt: 'Critical security finding requires human review.',
    timeout: 12 * 60 * 60_000,       // 12 hours
  },
  {
    id: 'cost_overrun',
    phase: '*',
    condition: (ctx) => ctx.cost.current > ctx.cost.budget * 0.8,
    prompt: 'Approaching cost budget. Continue?',
    timeout: 2 * 60 * 60_000,        // 2 hours
  },
];

8.3 Automation Ladder

The system starts conservative and earns autonomy based on track record.

Level 0 ─── Human does everything (current state)
  │
  │  After: system deployed, basic metrics working
  ▼
Level 1 ─── AI suggests, human decides
  │         - Review comments are suggestions only
  │         - Test failures analyzed but human fixes
  │         - Deploy requires explicit approval
  │
  │  After: false positive rate < 20%, 50+ successful runs
  ▼
Level 2 ─── AI acts, human reviews
  │         - Auto-fix formatting and simple lint issues
  │         - Auto-approve low-risk reviews
  │         - Still requires human for medium+ risk
  │
  │  After: 200+ runs, <5% false positive rate, 0 missed critical bugs
  ▼
Level 3 ─── AI acts, human notified
  │         - Auto-merge low-risk PRs
  │         - Auto-deploy to staging
  │         - Human notified, can override within window
  │
  │  After: 500+ runs, proven safety record
  ▼
Level 4 ─── Full autonomy (low-risk only)
            - Fully autonomous for low-risk changes
            - Human gates remain for medium+ risk
            - Human gates ALWAYS remain for production deploys

9. Event Bus & Observability

typescript
// ─── bus.ts ───────────────────────────────────────────────
// Simple in-memory pub/sub. Events are also persisted to SQLite.

class EventBus {
  private handlers = new Map<string, Set<EventHandler>>();
  private db: DrizzleDB;

  async emit(event: Omit<ForgeEvent, 'id' | 'timestamp'>): Promise<void> {
    const full: ForgeEvent = {
      ...event,
      id: ulid(),
      timestamp: new Date(),
    };

    // Persist to SQLite (append-only)
    await this.db.insert(events).values(full);

    // Notify subscribers
    const typeHandlers = this.handlers.get(event.type) ?? new Set();
    const wildcardHandlers = this.handlers.get('*') ?? new Set();

    for (const handler of [...typeHandlers, ...wildcardHandlers]) {
      handler(full);
    }
  }

  on(type: string, handler: EventHandler): () => void { /* subscribe */ }

  async replay(traceId: string): Promise<ForgeEvent[]> {
    return this.db
      .select()
      .from(events)
      .where(eq(events.traceId, traceId))
      .orderBy(events.timestamp);
  }
}

Key events emitted by the system:

Event Type	Source	When
`run.started`	Orchestrator	Pipeline begins
`phase.entered`	Orchestrator	Each phase transition
`agent.iteration`	Base Agent	Each loop iteration
`tool.executed`	Tool Layer	Each tool call
`finding.detected`	Reviewer	Issue found in code
`test.failed`	Tester	Test failure
`gate.requested`	Safety	Human approval needed
`gate.approved`	Safety	Human approved
`breaker.tripped`	Safety	Circuit breaker fired
`memory.stored`	Memory	New learning saved
`run.completed`	Orchestrator	Pipeline finished

10. Feedback Loops (The Nervous System)

Feedback loops are the P0 foundation — the first thing to build because everything else depends on information flowing back to where it's useful. Five distinct loops operate at different timescales.

10.1 The Five Loops

┌─────────────────────────────────────────────────────────────────────────┐
│                        FEEDBACK LOOP MAP                                 │
│                                                                          │
│  LOOP 1: INNER (ms)          Within one agent iteration                  │
│  ┌──────────────────────────────────────────────┐                       │
│  │  Tool call → Result → Reason → Adjust → Next │ ←── Tightest loop    │
│  └──────────────────────────────────────────────┘                       │
│                                                                          │
│  LOOP 2: PHASE (min)         Between pipeline phases                     │
│  ┌──────────────────────────────────────────────────────────┐           │
│  │  Implement → Review → Findings → Fix → Re-review         │           │
│  │  Implement → Test → Failures → Fix → Re-test             │           │
│  └──────────────────────────────────────────────────────────┘           │
│                                                                          │
│  LOOP 3: RUN (min-hr)        After a full pipeline completes             │
│  ┌──────────────────────────────────────────────────────────────┐       │
│  │  Run completes → Reflect → Extract patterns → Store memory   │       │
│  │  → Next run recalls memories → Better decisions              │       │
│  └──────────────────────────────────────────────────────────────┘       │
│                                                                          │
│  LOOP 4: HUMAN (hr-days)     Human feedback integration                  │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │  Agent suggests → Human dismisses → Confidence decreases         │   │
│  │  Agent suggests → Human approves with edits → Learn preferences  │   │
│  │  Agent misses issue → Human catches it → New pattern learned      │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  LOOP 5: PRODUCTION (days)   Deployed code feedback                      │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │  Deploy → Monitor → Anomaly detected → Correlate with change     │   │
│  │  → Generate bug report → Feed into planning as new task          │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

10.2 Loop 1: Inner Loop (already in agent design)

This is the perceive → reason → act → learn cycle inside every agent. Each tool call result feeds directly back into the next LLM reasoning step. No special infrastructure needed — it's the agent loop itself.

10.3 Loop 2: Phase Loop (the bounce-back)

When Review or Test finds issues, the pipeline doesn't just fail — it bounces back to the Implementer to fix things. This is the most important loop for code quality.

typescript
// ─── In the orchestrator pipeline ─────────────────────────

interface PhaseLoopConfig {
  maxBounces: number;      // How many review→fix→review cycles allowed
  phases: {
    // After review, if changes requested, bounce back to implement
    review: {
      onChangesRequested: 'implementation',  // Go back to this phase
      maxBounces: 3,
    },
    // After test, if failures found and auto-fixable, bounce back
    testing: {
      onFailure: 'implementation',
      maxBounces: 2,
    },
  };
}

// How the orchestrator handles bounces:
async function runPipelineWithBounces(task: string, ctx: PipelineContext) {
  const plan = await runPhase('planning', { task }, ctx);

  let code = await runPhase('implementation', plan, ctx);

  // Review loop: implement → review → fix → re-review (max 3x)
  let reviewBounces = 0;
  while (reviewBounces < 3) {
    const review = await runPhase('review', code, ctx);

    if (review.decision === 'approve') break;
    if (review.decision === 'require_human') {
      await ctx.gates.requestHumanReview(review);
      break;
    }

    // Bounce back: feed findings to implementer
    ctx.bus.emit({ type: 'loop.phase_bounce', payload: {
      from: 'review', to: 'implementation', bounce: ++reviewBounces,
      findings: review.findings,
    }});

    code = await runPhase('implementation', {
      ...plan,
      existingCode: code,
      fixFindings: review.findings,  // ← Feedback flows here
    }, ctx);
  }

  // Test loop: implement → test → fix → re-test (max 2x)
  let testBounces = 0;
  while (testBounces < 2) {
    const tests = await runPhase('testing', code, ctx);

    if (tests.summary.failed === 0) break;

    // Only auto-fix if failures are analyzable
    const fixable = tests.failures.filter(f => f.suggestedFix && f.confidence > 0.7);
    if (fixable.length === 0) {
      await ctx.gates.requestHumanHelp(tests.failures);
      break;
    }

    ctx.bus.emit({ type: 'loop.phase_bounce', payload: {
      from: 'testing', to: 'implementation', bounce: ++testBounces,
      failures: fixable,
    }});

    code = await runPhase('implementation', {
      ...plan,
      existingCode: code,
      fixFailures: fixable,  // ← Feedback flows here
    }, ctx);
  }

  // Deploy only if all gates pass
  await runPhase('deployment', { code, review, tests }, ctx);
}

Key insight: The phase loop carries structured feedback — not just "it failed" but Finding[] and FailureAnalysis[] with specific file/line locations, root causes, and suggested fixes. This is what makes the fix cycle productive rather than a blind retry.

10.4 Loop 3: Run Loop (post-run reflection)

After an entire pipeline run completes (or fails), the system reflects on the whole execution to extract durable learnings.

typescript
// ─── Triggered automatically after every pipeline run ─────

interface RunReflection {
  // What happened in this run?
  summary: {
    task: string;
    outcome: 'success' | 'failure' | 'partial';
    phases: PhaseOutcome[];
    totalCost: number;
    totalDuration: number;
    bounces: { phase: string; count: number }[];
  };

  // What should we remember?
  learnings: Learning[];
}

interface Learning {
  type: 'episodic' | 'semantic' | 'procedural';
  content: string;             // Human-readable insight
  context: string;             // When is this relevant?
  confidence: number;          // How sure are we?
  source: string;              // Which event triggered this?
}

async function reflectOnRun(traceId: string, ctx: Context): Promise<RunReflection> {
  // Replay all events from this run
  const events = await ctx.bus.replay(traceId);

  // Ask LLM to reflect
  const reflection = await ctx.llm.chat({
    system: RUN_REFLECTION_PROMPT,
    messages: [{
      role: 'user',
      content: `Here is the execution trace of a pipeline run.
        Analyze what happened and extract learnings.

        Focus on:
        - Patterns that could help future runs
        - Mistakes to avoid
        - Strategies that worked well
        - Any surprises or anomalies

        Events: ${JSON.stringify(summarizeEvents(events))}`,
    }],
  });

  // Store each learning in memory
  for (const learning of reflection.learnings) {
    await ctx.memory.store(learning);
  }

  return reflection;
}

Reflection triggers (not just at run completion):

Trigger	When	What to Reflect On
Run completed	Every run	Full execution trace
Phase bounced 2+ times	During run	Why are fixes not sticking?
Cost exceeded 50% of budget	During run	Are we being inefficient?
Error rate > 10% in a phase	During run	What's going wrong?
Human overrode a decision	On human input	What did we get wrong?

10.5 Loop 4: Human Feedback Loop

This is how the system learns from human behavior — not just explicit feedback, but implicit signals too.

typescript
// ─── Explicit human feedback ──────────────────────────────

// Human dismisses a review finding
async function onFindingDismissed(findingId: string, reason?: string) {
  const finding = await db.select().from(findings).where(eq(findings.id, findingId));

  // Mark as dismissed
  await db.update(findings).set({ dismissed: true, dismissedBy: reason }).where(eq(findings.id, findingId));

  // Decrease confidence in the pattern that generated this finding
  const relatedPatterns = await memory.recall({
    context: `review finding: ${finding.category} in ${finding.file}`,
    type: 'semantic',
  });

  for (const pattern of relatedPatterns) {
    await memory.update(pattern.id, {
      confidence: pattern.confidence - 0.2,  // Significant penalty
    });
  }

  // Learn from the dismissal
  await memory.store({
    type: 'semantic',
    content: `Finding "${finding.message}" was dismissed by human. ${reason || 'No reason given.'}`,
    context: `reviewing ${finding.category} issues in ${finding.file}`,
    confidence: 0.7,  // Human-sourced = higher confidence
  });

  bus.emit({ type: 'feedback.human_dismissed', payload: { findingId, reason } });
}

// Human approves with modifications
async function onHumanApprovedWithEdits(gateId: string, edits: string) {
  await memory.store({
    type: 'procedural',
    content: `Human approved but made edits: ${edits}`,
    context: `gate ${gateId}`,
    confidence: 0.8,
  });

  bus.emit({ type: 'feedback.human_edited', payload: { gateId, edits } });
}

// ─── Implicit human signals ──────────────────────────────

// Track which suggestions humans actually apply vs ignore
interface ImplicitFeedback {
  // If human applies the suggested fix → boost confidence
  onSuggestedFixApplied(findingId: string): void;

  // If human rewrites the fix differently → learn their preference
  onSuggestedFixRewritten(findingId: string, humanVersion: string): void;

  // If human adds a comment the agent didn't catch → learn the gap
  onHumanAddedComment(pr: string, comment: string): void;

  // Time-to-dismiss: if dismissed within seconds, it was obviously wrong
  // If dismissed after minutes, it was at least worth considering
  onDismissalTiming(findingId: string, timeToDecisionMs: number): void;
}

10.6 Loop 5: Production Loop (post-MVP, but designed now)

After deployment, production metrics feed back into the system as new tasks or pattern updates.

typescript
// ─── Post-MVP but the interface is designed now ───────────

interface ProductionFeedback {
  // Monitor detects anomaly correlated with recent deploy
  onAnomalyDetected(anomaly: {
    metric: string;           // "error_rate", "latency_p95"
    baseline: number;
    current: number;
    deploymentId: string;     // Which deploy caused this?
  }): Promise<void>;

  // Error report from production maps back to a code change
  onProductionError(error: {
    stack: string;
    frequency: number;
    firstSeen: Date;
    affectedUsers: number;
    relatedCommit?: string;   // Git blame correlation
  }): Promise<void>;
}

// In MVP: these interfaces exist but the implementation is a no-op.
// Post-MVP: they connect to real monitoring and create new pipeline tasks.

10.7 Feedback Loop Metrics

How do we know the loops are working?

typescript
interface FeedbackMetrics {
  // Inner loop health
  avgIterationsPerPhase: number;       // Trending down = agents getting smarter
  toolSuccessRate: number;             // Trending up = better tool selection

  // Phase loop health
  avgBouncesPerRun: number;            // Trending down = better first-pass quality
  bounceResolutionRate: number;        // % of bounces that fix the issue

  // Run loop health
  learningsPerRun: number;             // Are we extracting value?
  learningApplicationRate: number;     // % of recalled memories that helped
  memoryPrecision: number;             // Recalled memories that were relevant

  // Human loop health
  findingDismissalRate: number;        // Trending down = fewer false positives
  humanOverrideRate: number;           // Trending down = better autonomous decisions
  timeToHumanResponse: number;         // How fast do humans respond to gates?

  // Cross-run improvement
  costPerRun: number;                  // Trending down = efficiency improving
  successRateOverTime: number;         // Trending up = system is learning
  firstPassApprovalRate: number;       // % of reviews approved without bounces
}

10.8 Build Priority for Feedback Loops

This is what makes feedback loops the P0 foundation — they must exist before agents are useful:

Week 1 (build with core):
  ✓ Event bus (emit/subscribe) — enables all loops to capture signals
  ✓ Events table in SQLite — persist signals for later analysis
  ✓ Bus replay — reconstruct what happened in any run

Week 2 (build with memory):
  ✓ Run reflection — Loop 3 (post-run learning)
  ✓ Memory store + recall — the destination for all learnings
  ✓ Confidence scoring — weight learnings by source

Week 3 (build with reviewer):
  ✓ Phase bounce logic — Loop 2 (review→fix→re-review)
  ✓ Finding dismissal tracking — Loop 4 (human feedback)
  ✓ Dismissal → confidence decay — close the human loop

Week 4 (build with tester):
  ✓ Test failure → fix → retest bounce — Loop 2 extension
  ✓ Failure pattern memory — learn from repeated test failures

Week 6 (build with orchestrator):
  ✓ Full phase loop orchestration — Loop 2 with configurable bounces
  ✓ Reflection triggers (cost, error rate, bounces) — Loop 3 enrichment
  ✓ Feedback metrics dashboard — measure loop health

Post-MVP:
  ○ Production monitoring integration — Loop 5
  ○ Implicit human signal tracking — Loop 4 enrichment
  ○ Meta-reflection — reflect on reflection quality

11. Memory & Learning (Where Feedback Lands)

How memories flow through the system:

  Execution Events
       │
       ▼
  ┌─ CAPTURE ──────────────────┐
  │ Every tool result, error,   │
  │ human decision, and test    │
  │ outcome is captured as an   │
  │ event.                      │
  └──────────┬─────────────────┘
             ▼
  ┌─ REFLECT ──────────────────┐
  │ After each agent completes: │
  │ "What worked? What didn't?  │
  │  What should we remember?"  │
  │                              │
  │ LLM extracts learnings as   │
  │ structured memories.         │
  └──────────┬─────────────────┘
             ▼
  ┌─ STORE ────────────────────┐
  │ Episodic: "PR #42 review    │
  │   missed a null check"      │
  │ Semantic: "Auth endpoints    │
  │   in this repo use JWT"     │
  │ Procedural: "When tests     │
  │   fail with mock errors,    │
  │   add clearAllMocks()"      │
  └──────────┬─────────────────┘
             ▼
  ┌─ CONSOLIDATE ──────────────┐  (runs periodically)
  │ Merge similar memories       │
  │ Decay unused memories        │
  │ Promote high-frequency       │
  │   episodes to patterns       │
  │ Prune low-confidence entries │
  └──────────┬─────────────────┘
             ▼
  ┌─ RECALL ───────────────────┐
  │ When an agent starts:        │
  │ "What do I know about this   │
  │  kind of task?"              │
  │                              │
  │ Similarity search on         │
  │ context + tags returns       │
  │ relevant memories to inject  │
  │ into the agent's prompt.     │
  └────────────────────────────┘

Confidence & Decay

Confidence starts at:
  - 0.5  for LLM-extracted learnings (unvalidated)
  - 0.7  for human-confirmed learnings
  - 0.9  for learnings from production outcomes

Confidence changes:
  - +0.1  each time the pattern is successfully applied
  - +0.2  when a human confirms the learning
  - -0.05 per week without access (decay)
  - -0.2  when a human dismisses a suggestion based on it

Pruning:
  - Memories below 0.2 confidence are archived
  - Memories not accessed in 90 days are archived
  - Conflicting memories: keep highest confidence

12. LLM Provider Abstraction

typescript
// ─── llm.ts ───────────────────────────────────────────────
// Provider-agnostic. Swap Claude for OpenAI or local models.

interface LLMProvider {
  chat(request: ChatRequest): Promise<ChatResponse>;
  embed(text: string): Promise<Float32Array>;
}

interface ChatRequest {
  system: string;
  messages: Message[];
  tools?: ToolSchema[];
  temperature?: number;
  maxTokens?: number;
}

interface ChatResponse {
  content: string;
  toolCalls?: ToolCall[];
  done: boolean;
  result?: unknown;
  usage: { promptTokens: number; completionTokens: number };
  cost: number;  // USD, calculated from model pricing
}

// Model selection by task complexity + remaining budget:
//
//   Planning / Architecture  →  Claude Sonnet (strong reasoning)
//   Implementation           →  Claude Sonnet (code generation)
//   Review (AI layer)        →  Claude Haiku (fast, cheap, good enough)
//   Test analysis            →  Claude Haiku
//   Reflection               →  Claude Haiku
//   Embedding                →  Local model or API (cheap, fast)

13. Build Order

The research roadmap says "start with feedback loops." I agree, but with a twist: build the skeleton first, then fill in the organs.

Week 1 ── Core Skeleton
├── types.ts (core abstractions from Section 3)
├── bus.ts (event bus)
├── config.ts (safety defaults from Section 8)
├── errors.ts (error taxonomy)
├── schema.ts (Drizzle schema from Section 5)
├── llm.ts (provider abstraction)
└── base.ts (base agent with loop + reflection)

Week 2 ── Memory + Tools
├── store.ts (memory CRUD)
├── episodes.ts + patterns.ts (episodic + semantic memory)
├── registry.ts (tool registry)
├── git.ts, runner.ts, linter.ts (essential tools)
└── First integration test: agent loop with real LLM

Week 3 ── Reviewer Agent (first vertical slice)
├── reviewer.ts (3-layer review: static → security → AI)
├── github.ts (PR integration)
├── Risk scoring
└── Findings persistence + dismissal learning

Week 4 ── Tester Agent
├── tester.ts (test selection + execution + analysis)
├── test-runner.ts (Jest/Vitest integration)
├── Failure analysis with LLM
└── Test gap detection

Week 5 ── Planner + Implementer Agents
├── planner.ts (requirements → plan → tasks)
├── implementer.ts (tasks → code)
├── Self-validation loop (typecheck + test after each change)
└── Feedback from reviewer/tester flows back

Week 6 ── Orchestrator
├── pipeline.ts (state machine: plan → implement → review → test → deploy)
├── checkpoint.ts (save/resume between phases)
├── context.ts (shared state across agents)
├── gates.ts (human approval integration)
└── End-to-end flow: "forge run" works

Week 7 ── CLI + Polish
├── CLI commands: run, review, test, status, history
├── Terminal UI (progress, findings display)
├── forge.config.ts (per-project configuration)
└── Consolidation job (memory pruning + pattern extraction)

Week 8 ── Harden + Document
├── Error recovery (retry, fallback, checkpoint resume)
├── Cost tracking dashboard
├── Real-world testing on actual projects
└── Edge case handling

14. Configuration

One config file per project. Sensible defaults, override what you need.

typescript
// ─── forge.config.ts ──────────────────────────────────────

import { defineConfig } from 'forge';

export default defineConfig({
  // Project basics
  name: 'my-app',
  language: 'typescript',

  // LLM provider
  llm: {
    provider: 'anthropic',           // 'anthropic' | 'openai' | 'ollama'
    model: 'claude-sonnet-4-5-20250929',
    fastModel: 'claude-haiku-4-5-20251001',  // For cheap tasks
  },

  // Tools
  tools: {
    testCommand: 'bun test',
    lintCommand: 'bun run lint',
    buildCommand: 'bun run build',
    typecheckCommand: 'bun run typecheck',
  },

  // Safety (override defaults from Section 8)
  safety: {
    costPerRun: 50,                  // USD max
    costPerDay: 200,
    automationLevel: 1,              // 0-4, see Section 8.3
  },

  // GitHub integration
  github: {
    owner: 'myorg',
    repo: 'my-app',
    reviewOnPR: true,                // Auto-review new PRs
    postComments: true,              // Post findings as PR comments
  },

  // Memory
  memory: {
    dbPath: '.forge/memory.db',      // SQLite database location
    consolidateInterval: '1d',       // Run consolidation daily
    maxMemories: 10_000,             // Prune beyond this
  },
});

15. What This Design Explicitly Defers

These are real concerns acknowledged by the research but not in scope for MVP:

Deferred	Why	When
Parallel agent execution	Sequential is simpler, prove it works first	Post-MVP
Kubernetes deployment	Single Bun process is fine for a tool	If scaling needed
Vector database	SQLite with manual similarity is enough to start	When memory > 100K entries
Multi-repo intelligence	Focus on single repo first	Q3+
Autonomous deployment	Always require human approval for now	After Level 3 automation
Natural language requirements	Start with structured task descriptions	Q3+
Real-time dashboards	CLI output + SQLite queries are enough	When team size > 1
ClickHouse / Kafka	SQLite event table handles MVP observability	At scale

This design synthesizes research from 13 topics across agentic loops, feedback mechanisms, code review, testing, CI/CD, orchestration, evaluation, self-improvement, reflection, human-AI collaboration, context management, tool integration, and error recovery.