26 min
architecture
February 8, 2026

Forge: Agentic SDLC Orchestrator — System Design

Forge: Agentic SDLC Orchestrator — System Design

An opinionated, buildable design for an AI-driven software development lifecycle tool. Distilled from 13 research topics, 4 synthesis documents, and 70+ interface specifications.


1. What We're Building

Forge is a CLI tool and TypeScript library that orchestrates AI agents through the software development lifecycle. You give it a task — a feature, bug fix, or refactor — and it plans, implements, reviews, tests, and deploys the change. Humans stay in the loop for high-stakes decisions. The system learns from every execution.

It is not a distributed microservice platform. It's a single Bun process backed by SQLite that coordinates LLM calls, tool executions, and human checkpoints through a pipeline.

Design Principles

  1. Start simple, earn complexity — Sequential pipeline first, parallel agents later
  2. Learn from everything — Every execution feeds the memory system
  3. Safe by default — Circuit breakers and human gates baked in, not bolted on
  4. Observable — Every decision logged with rationale, every action attributed
  5. Tool-agnostic — Swap LLM providers, CI systems, or git hosts without rewriting agents

2. Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                          CLI / API                                │
│  forge run "add user auth"    forge review PR#42    forge test    │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│                       ORCHESTRATOR                                │
│                                                                   │
│   Pipeline: Plan → Implement → Review → Test → Deploy             │
│   State Machine  ·  Checkpoints  ·  Human Gates                   │
│                                                                   │
│   ┌─────────────────────────────────────────────────────────┐    │
│   │                    EVENT BUS                              │    │
│   │  emit()  ·  on()  ·  replay()  ·  snapshot()             │    │
│   └─────────────────────────────────────────────────────────┘    │
└───────┬──────────┬──────────┬──────────┬──────────┬──────────────┘
        │          │          │          │          │
        ▼          ▼          ▼          ▼          ▼
   ┌─────────┐┌─────────┐┌─────────┐┌─────────┐┌─────────┐
   │ Planner ││Implement││Reviewer ││ Tester  ││Deployer │
   │  Agent  ││  Agent  ││  Agent  ││  Agent  ││  Agent  │
   └────┬────┘└────┬────┘└────┬────┘└────┬────┘└────┬────┘
        │          │          │          │          │
        └──────────┴──────────┴──────────┴──────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│                       TOOL LAYER                                  │
│                                                                   │
│   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐        │
│   │ LLM  │ │ Git  │ │GitHub│ │ Test │ │ Lint │ │ Shell│        │
│   │Client│ │  Ops │ │  API │ │Runner│ │/Fmt  │ │ Exec │        │
│   └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘        │
└──────────────────────────────┬───────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────────┐
│                       MEMORY LAYER                                │
│                                                                   │
│   ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐    │
│   │ Episodic  │  │ Semantic  │  │Procedural │  │  Events   │    │
│   │ (what     │  │ (patterns │  │ (how to   │  │ (audit    │    │
│   │  happened)│  │  & facts) │  │  do stuff)│  │  trail)   │    │
│   └───────────┘  └───────────┘  └───────────┘  └───────────┘    │
│                                                                   │
│                     SQLite via Drizzle ORM                        │
└──────────────────────────────────────────────────────────────────┘

3. Core Abstractions

Everything in the system is built on 6 types. Every other type composes these.

typescript
// ─── The Agent Loop ─────────────────────────────────────── // Every agent follows the same cycle: perceive → reason → act → learn. // The orchestrator runs agents. Agents run tools. Tools do work. interface Agent { id: string; type: AgentType; /** Run one cycle of the agent loop */ execute(input: PhaseInput, ctx: AgentContext): Promise<PhaseOutput>; } // ─── The Event ──────────────────────────────────────────── // Everything that happens is an event. Events are the source of truth. // The memory system, audit trail, and observability all consume events. interface ForgeEvent { id: string; traceId: string; // Groups events in one pipeline run timestamp: Date; source: string; // Which agent/component emitted this type: string; // Dot-namespaced: "review.finding", "test.failed" payload: unknown; cost?: { tokens: number; usd: number }; } // ─── The Tool ───────────────────────────────────────────── // Tools are the hands of agents. An agent reasons about what to do, // then executes a tool to do it. Tools are sandboxed and audited. interface Tool<TInput = unknown, TOutput = unknown> { name: string; description: string; schema: { input: ZodSchema<TInput>; output: ZodSchema<TOutput> }; execute(input: TInput, ctx: ToolContext): Promise<TOutput>; } // ─── The Phase ──────────────────────────────────────────── // The pipeline is a sequence of phases. Each phase has an agent, // input/output types, and safety controls. interface Phase { name: PhaseName; agent: Agent; guards: Guard[]; // Pre-conditions to enter this phase gates: HumanGate[]; // Human approval checkpoints breakers: CircuitBreaker[]; // Safety limits next: PhaseName | null; } // ─── The Memory ─────────────────────────────────────────── // Memories are what the system learned. They have types, relevance // scores, and decay over time if not reinforced. interface Memory { id: string; type: 'episodic' | 'semantic' | 'procedural'; content: string; embedding?: Float32Array; // For similarity search confidence: number; // 0-1, decays without reinforcement context: string; // When is this memory relevant? createdAt: Date; lastAccessed: Date; accessCount: number; } // ─── The Checkpoint ─────────────────────────────────────── // Checkpoints save pipeline state between phases. If something fails, // we can resume from the last checkpoint instead of starting over. interface Checkpoint { id: string; traceId: string; phase: PhaseName; state: Record<string, unknown>; // Serialized phase outputs so far timestamp: Date; }

4. Module Map

forge/
│
├── src/
│   ├── core/                    # Foundation — build this first
│   │   ├── types.ts             # Core types from Section 3
│   │   ├── bus.ts               # In-memory event bus
│   │   ├── config.ts            # Runtime configuration + defaults
│   │   └── errors.ts            # Error taxonomy (source, severity, recoverability)
│   │
│   ├── safety/                  # Guardrails — build alongside core
│   │   ├── breakers.ts          # Circuit breakers (iteration, cost, time, error-rate)
│   │   ├── gates.ts             # Human approval gates
│   │   └── budget.ts            # Cost tracking and limits
│   │
│   ├── memory/                  # Learning foundation — Week 1-2
│   │   ├── schema.ts            # Drizzle SQLite schema
│   │   ├── store.ts             # Memory CRUD + similarity search
│   │   ├── episodes.ts          # Episodic memory (what happened)
│   │   ├── patterns.ts          # Semantic memory (pattern extraction)
│   │   ├── procedures.ts        # Procedural memory (strategies that work)
│   │   └── consolidate.ts       # Knowledge consolidation + pruning
│   │
│   ├── tools/                   # Tool layer — Week 1-2
│   │   ├── registry.ts          # Tool registry + discovery
│   │   ├── sandbox.ts           # Execution sandboxing
│   │   ├── llm.ts               # LLM provider abstraction
│   │   ├── git.ts               # Git operations
│   │   ├── github.ts            # GitHub API (PRs, reviews, webhooks)
│   │   ├── runner.ts            # Shell command execution
│   │   ├── linter.ts            # ESLint/Biome integration
│   │   └── test-runner.ts       # Jest/Vitest execution + parsing
│   │
│   ├── agents/                  # Agent implementations — Week 3+
│   │   ├── base.ts              # Base agent with loop, reflection, safety
│   │   ├── planner.ts           # Requirements → architecture → tasks
│   │   ├── implementer.ts       # Tasks → code
│   │   ├── reviewer.ts          # Code → findings + risk score
│   │   ├── tester.ts            # Code → tests → results → analysis
│   │   └── deployer.ts          # Artifact → canary → rollout
│   │
│   ├── orchestrator/            # Pipeline coordination — Week 7-8
│   │   ├── pipeline.ts          # Phase sequencing state machine
│   │   ├── checkpoint.ts        # State persistence between phases
│   │   └── context.ts           # Shared context across agents
│   │
│   └── cli/                     # User interface
│       ├── index.ts             # CLI entry point
│       ├── commands/            # run, review, test, status, etc.
│       └── ui.ts                # Terminal output formatting
│
├── drizzle/                     # Database migrations
├── forge.config.ts              # Project-level configuration
└── package.json

5. Data Model (SQLite / Drizzle)

This is the ground truth. Everything the system knows lives here.

typescript
// ─── schema.ts ──────────────────────────────────────────── import { sqliteTable, text, integer, real, blob } from 'drizzle-orm/sqlite-core'; // ─── Events (Append-only audit trail) ───────────────────── export const events = sqliteTable('events', { id: text('id').primaryKey(), // ulid traceId: text('trace_id').notNull(), // groups one pipeline run timestamp: integer('timestamp', { mode: 'timestamp_ms' }).notNull(), source: text('source').notNull(), // agent or component id type: text('type').notNull(), // "plan.started", "review.finding", etc. phase: text('phase'), // current pipeline phase payload: text('payload', { mode: 'json' }), // event-specific data tokensUsed:integer('tokens_used'), costUsd: real('cost_usd'), durationMs:integer('duration_ms'), }); // ─── Memories (What the system has learned) ─────────────── export const memories = sqliteTable('memories', { id: text('id').primaryKey(), type: text('type').notNull(), // episodic | semantic | procedural content: text('content').notNull(), // Human-readable description context: text('context').notNull(), // When is this relevant? embedding: blob('embedding'), // Float32Array for similarity search confidence: real('confidence').notNull(), // 0.0 - 1.0 source: text('source'), // What event created this? tags: text('tags', { mode: 'json' }), // ["typescript", "testing", "auth"] createdAt: integer('created_at', { mode: 'timestamp_ms' }).notNull(), lastAccessed:integer('last_accessed', { mode: 'timestamp_ms' }).notNull(), accessCount: integer('access_count').notNull().default(0), }); // ─── Patterns (Extracted from episodes) ─────────────────── export const patterns = sqliteTable('patterns', { id: text('id').primaryKey(), type: text('type').notNull(), // success | failure | approach trigger: text('trigger').notNull(), // What situation activates this? pattern: text('pattern').notNull(), // The pattern itself resolution: text('resolution'), // What to do when triggered frequency: integer('frequency').notNull().default(1), successRate: real('success_rate'), // How often this works confidence: real('confidence').notNull(), lastSeen: integer('last_seen', { mode: 'timestamp_ms' }).notNull(), }); // ─── Checkpoints (Pipeline state snapshots) ─────────────── export const checkpoints = sqliteTable('checkpoints', { id: text('id').primaryKey(), traceId: text('trace_id').notNull(), phase: text('phase').notNull(), state: text('state', { mode: 'json' }).notNull(), timestamp: integer('timestamp', { mode: 'timestamp_ms' }).notNull(), }); // ─── Runs (Pipeline execution history) ──────────────────── export const runs = sqliteTable('runs', { id: text('id').primaryKey(), // = traceId task: text('task').notNull(), // Human description of what was requested status: text('status').notNull(), // pending | running | completed | failed | cancelled currentPhase:text('current_phase'), config: text('config', { mode: 'json' }), // Runtime config snapshot startedAt: integer('started_at', { mode: 'timestamp_ms' }).notNull(), completedAt: integer('completed_at', { mode: 'timestamp_ms' }), totalCostUsd:real('total_cost_usd').default(0), totalTokens: integer('total_tokens').default(0), error: text('error'), // If failed, why }); // ─── Findings (Review/test issues) ──────────────────────── export const findings = sqliteTable('findings', { id: text('id').primaryKey(), runId: text('run_id').notNull(), phase: text('phase').notNull(), // review | test severity: text('severity').notNull(), // info | warning | error | critical category: text('category').notNull(), // style | security | correctness | performance message: text('message').notNull(), file: text('file'), line: integer('line'), confidence: real('confidence'), fixable: integer('fixable', { mode: 'boolean' }), fix: text('fix'), // Suggested code change dismissed: integer('dismissed', { mode: 'boolean' }).default(false), dismissedBy:text('dismissed_by'), // Who dismissed and why — for learning });

6. The Agent Loop

Every agent runs the same core loop. The only thing that changes is the tools available and the reasoning prompt.

  ┌──────────────────────────────────────────────────┐
  │                   AGENT LOOP                      │
  │                                                   │
  │   ┌──────────┐                                   │
  │   │ PERCEIVE │ ← Gather context:                 │
  │   │          │   - Task/phase input               │
  │   │          │   - Relevant memories              │
  │   │          │   - Previous iteration results     │
  │   └────┬─────┘                                   │
  │        ▼                                          │
  │   ┌──────────┐                                   │
  │   │  REASON  │ ← LLM decides:                   │
  │   │          │   - What tool to use next          │
  │   │          │   - Or: task is complete            │
  │   │          │   - Or: need human input            │
  │   └────┬─────┘                                   │
  │        ▼                                          │
  │   ┌──────────┐                                   │
  │   │   ACT    │ ← Execute tool:                   │
  │   │          │   - Validate input (Zod)           │
  │   │          │   - Run in sandbox                 │
  │   │          │   - Capture result + metrics        │
  │   └────┬─────┘                                   │
  │        ▼                                          │
  │   ┌──────────┐                                   │
  │   │  LEARN   │ ← After each iteration:           │
  │   │          │   - Log event to bus               │
  │   │          │   - Check circuit breakers          │
  │   │          │   - Update working memory           │
  │   │          │   - Reflect if error occurred        │
  │   └────┬─────┘                                   │
  │        │                                          │
  │        ├── Continue? ──▶ Loop back to PERCEIVE    │
  │        ├── Done? ──────▶ Return PhaseOutput       │
  │        ├── Stuck? ─────▶ Escalate to human        │
  │        └── Breaker? ───▶ Halt with error          │
  └──────────────────────────────────────────────────┘
typescript
// ─── base.ts ────────────────────────────────────────────── abstract class BaseAgent implements Agent { abstract type: AgentType; abstract tools: Tool[]; abstract systemPrompt: string; async execute(input: PhaseInput, ctx: AgentContext): Promise<PhaseOutput> { let iteration = 0; let workingMemory = await this.perceive(input, ctx); while (true) { iteration++; // ── Safety check ── const breakerResult = ctx.safety.check({ iteration, cost: ctx.cost, elapsed: ctx.elapsed }); if (breakerResult.shouldBreak) { ctx.bus.emit({ type: `${this.type}.breaker_tripped`, payload: breakerResult }); throw new CircuitBreakerError(breakerResult); } // ── Reason: ask LLM what to do ── const decision = await ctx.llm.chat({ system: this.systemPrompt, messages: workingMemory.messages, tools: this.tools.map(t => t.schema), }); // ── Done? ── if (decision.done) { const output = decision.result as PhaseOutput; ctx.bus.emit({ type: `${this.type}.completed`, payload: output }); await this.reflect(ctx, 'success'); return output; } // ── Act: execute the chosen tool ── const tool = this.tools.find(t => t.name === decision.toolCall.name); const result = await this.executeTool(tool, decision.toolCall.input, ctx); // ── Learn: update context ── workingMemory = this.updateWorkingMemory(workingMemory, decision, result); if (result.error) { await this.reflect(ctx, 'error', result.error); } } } private async perceive(input: PhaseInput, ctx: AgentContext): Promise<WorkingMemory> { const relevantMemories = await ctx.memory.recall({ context: input.task, type: this.type, limit: 10, }); return { messages: [ { role: 'user', content: this.buildPrompt(input, relevantMemories) }, ], }; } private async reflect(ctx: AgentContext, outcome: string, error?: Error) { // Post-execution reflection: extract learnings const reflection = await ctx.llm.chat({ system: REFLECTION_PROMPT, messages: [{ role: 'user', content: `Outcome: ${outcome}. ${error ? `Error: ${error.message}` : ''}\nWhat should we remember for next time?` }], }); if (reflection.learnings) { for (const learning of reflection.learnings) { await ctx.memory.store({ type: 'procedural', content: learning.content, context: learning.context, confidence: learning.confidence, }); } } } }

7. Agent Designs

7.1 Planner Agent

Input: Natural language task description Output: Implementation plan with architecture, tasks, risk assessment Tools: read_file, glob, grep, llm_analyze

"Add user authentication"
        │
        ▼
  ┌─ PERCEIVE ─┐
  │ Read existing codebase structure         │
  │ Recall patterns for "auth" from memory   │
  │ Check for existing auth utilities         │
  └──────┬──────┘
         ▼
  ┌── REASON ──┐
  │ Decompose into tasks:                    │
  │  1. Design auth schema                   │
  │  2. Create login/register endpoints      │
  │  3. Add session middleware               │
  │  4. Protect routes                       │
  │ Estimate risk: MEDIUM (new feature)      │
  │ Identify dependencies                    │
  └──────┬──────┘
         ▼
  OUTPUT: ImplementationPlan {
    architecture: { components, interfaces, decisions }
    tasks: Task[]          // ordered, with dependencies
    risk: RiskAssessment   // determines review depth
    estimates: { complexity, effort }
  }

7.2 Implementer Agent

Input: Implementation plan + task list Output: Code changes (files modified/created) Tools: read_file, write_file, run_command, llm_generate, search_code

ImplementationPlan
        │
        ▼
  For each task (sequential MVP, parallel later):
  ┌─ PERCEIVE ─┐
  │ Read target files                         │
  │ Understand existing patterns              │
  │ Load procedural memories for this domain  │
  └──────┬──────┘
         ▼
  ┌── REASON ──┐
  │ Generate code change                     │
  │ Self-validate: does this match the spec? │
  │ Check for obvious issues                 │
  └──────┬──────┘
         ▼
  ┌─── ACT ────┐
  │ Write files                              │
  │ Run typecheck                            │
  │ Run affected tests                       │
  │ Fix issues if found, loop back           │
  └──────┬──────┘
         ▼
  OUTPUT: CodeChanges {
    files: FileChange[]    // path, before, after
    testsAdded: string[]   // new test files
    validated: boolean     // typecheck + tests pass
  }

7.3 Reviewer Agent

Input: Code changes (diff) Output: Review with findings, risk score, gate decision Tools: run_linter, run_security_scan, llm_review, read_file

CodeChanges
        │
        ▼
  ┌─ Layer 1: Static Analysis ──┐  (fast, cheap, deterministic)
  │ ESLint / Biome               │
  │ TypeScript strict check      │
  │ Formatting check             │
  └──────────┬───────────────────┘
             ▼
  ┌─ Layer 2: Security Scan ────┐  (fast, important)
  │ Secret detection             │
  │ Dependency vulnerability     │
  │ Known insecure patterns      │
  └──────────┬───────────────────┘
             ▼
  ┌─ Layer 3: AI Review ────────┐  (slow, expensive — only if risk > low)
  │ Logic correctness            │
  │ Edge cases                   │
  │ Performance implications     │
  │ Architecture fit             │
  └──────────┬───────────────────┘
             ▼
  ┌─ Synthesis ─────────────────┐
  │ Deduplicate findings         │
  │ Calculate risk score         │
  │ Determine gate decision      │
  └──────────┬───────────────────┘
             ▼
  OUTPUT: ReviewResult {
    findings: Finding[]
    riskScore: { total, level: 'low'|'medium'|'high'|'critical' }
    decision: 'approve' | 'request_changes' | 'require_human'
  }

Risk-based review depth:

Risk LevelStaticSecurityAI ReviewHuman Required
LowYesYesNoNo
MediumYesYesYesOptional
HighYesYesYesYes
CriticalYesYesYesYes + Architect

7.4 Tester Agent

Input: Code changes + existing test suite Output: Test results + failure analysis + coverage Tools: run_tests, llm_analyze, read_file, write_file

CodeChanges
        │
        ▼
  ┌─ Select ────────────────────┐
  │ Which tests to run?          │
  │  - Tests covering changed files (always)
  │  - Related integration tests (if medium+ risk)
  │  - Full suite (if high+ risk)
  └──────────┬───────────────────┘
             ▼
  ┌─ Execute ───────────────────┐
  │ Run selected tests           │
  │ Retry failures once (flaky?) │
  │ Collect coverage             │
  └──────────┬───────────────────┘
             ▼
  ┌─ Analyze ───────────────────┐
  │ If failures:                 │
  │  - Classify: real bug vs flaky vs env issue
  │  - Root cause analysis (LLM)
  │  - Suggest fix              │
  │ If low coverage:            │
  │  - Identify gaps            │
  │  - Generate missing tests   │
  └──────────┬───────────────────┘
             ▼
  OUTPUT: TestResult {
    summary: { total, passed, failed, skipped }
    coverage: { line, branch, function, diff }
    failures: FailureAnalysis[]    // with root cause + fix
    generatedTests: TestFile[]     // new tests if gaps found
  }

7.5 Deployer Agent

Input: Validated code + test results Output: Deployment status Tools: run_command, github_api, read_file

ValidatedCode + TestResults
        │
        ▼
  ┌─ Build ─────────────────────┐
  │ Run build command            │
  │ Verify artifact              │
  └──────────┬───────────────────┘
             ▼
  ┌─ Gate: Human Approval ──────┐  (always for production)
  │ Show summary:                │
  │  - What changed              │
  │  - Risk score                │
  │  - Test results              │
  │  - Findings                  │
  │ Wait for approval            │
  └──────────┬───────────────────┘
             ▼
  ┌─ Deploy ────────────────────┐
  │ Strategy based on risk:      │
  │  - Low: direct deploy        │
  │  - Medium: canary (5%→25%→100%)
  │  - High: canary (5%→10%→25%→50%→100%)
  └──────────┬───────────────────┘
             ▼
  ┌─ Verify ────────────────────┐
  │ Health check endpoints       │
  │ Error rate vs baseline       │
  │ Latency vs baseline          │
  │ Auto-rollback if unhealthy   │
  └──────────┬───────────────────┘
             ▼
  OUTPUT: DeploymentResult {
    status: 'healthy' | 'degraded' | 'rolled_back'
    metrics: { errorRate, latency, throughput }
    url: string
  }

8. Safety System

8.1 Circuit Breakers

Four breakers run continuously. Any one can halt execution.

typescript
interface SafetyConfig { breakers: { iteration: { default: 10, planning: 20, implementation: 50, testing: 5, deployment: 3, stagnationThreshold: 3, // Consecutive iterations with no progress }; cost: { perPhase: { // USD planning: 5, implementation: 10, review: 2, testing: 3, deployment: 2, }, perRun: 50, perDay: 200, }; time: { // milliseconds planning: 30 * 60_000, // 30 min implementation: 60 * 60_000, // 1 hour review: 30 * 60_000, // 30 min testing: 20 * 60_000, // 20 min deployment: 15 * 60_000, // 15 min totalPipeline: 120 * 60_000, // 2 hours }; errorRate: { window: 5 * 60_000, // 5 minute sliding window warning: 0.10, // 10% critical: 0.25, // 25% → halt }; }; }

8.2 Human Gates

typescript
// Gates are checkpoints where execution pauses for human input. // They fire based on conditions, not every time. const GATES: HumanGate[] = [ { id: 'architecture_approval', phase: 'planning', condition: (plan) => plan.risk.level === 'high' || plan.risk.level === 'critical', prompt: 'Review proposed architecture before implementation begins.', timeout: 24 * 60 * 60_000, // 24 hours }, { id: 'production_deploy', phase: 'deployment', condition: (ctx) => ctx.environment === 'production', prompt: 'Approve production deployment.', timeout: 60 * 60_000, // 1 hour }, { id: 'security_findings', phase: 'review', condition: (review) => review.findings.some(f => f.severity === 'critical' && f.category === 'security'), prompt: 'Critical security finding requires human review.', timeout: 12 * 60 * 60_000, // 12 hours }, { id: 'cost_overrun', phase: '*', condition: (ctx) => ctx.cost.current > ctx.cost.budget * 0.8, prompt: 'Approaching cost budget. Continue?', timeout: 2 * 60 * 60_000, // 2 hours }, ];

8.3 Automation Ladder

The system starts conservative and earns autonomy based on track record.

Level 0 ─── Human does everything (current state)
  │
  │  After: system deployed, basic metrics working
  ▼
Level 1 ─── AI suggests, human decides
  │         - Review comments are suggestions only
  │         - Test failures analyzed but human fixes
  │         - Deploy requires explicit approval
  │
  │  After: false positive rate < 20%, 50+ successful runs
  ▼
Level 2 ─── AI acts, human reviews
  │         - Auto-fix formatting and simple lint issues
  │         - Auto-approve low-risk reviews
  │         - Still requires human for medium+ risk
  │
  │  After: 200+ runs, <5% false positive rate, 0 missed critical bugs
  ▼
Level 3 ─── AI acts, human notified
  │         - Auto-merge low-risk PRs
  │         - Auto-deploy to staging
  │         - Human notified, can override within window
  │
  │  After: 500+ runs, proven safety record
  ▼
Level 4 ─── Full autonomy (low-risk only)
            - Fully autonomous for low-risk changes
            - Human gates remain for medium+ risk
            - Human gates ALWAYS remain for production deploys

9. Event Bus & Observability

typescript
// ─── bus.ts ─────────────────────────────────────────────── // Simple in-memory pub/sub. Events are also persisted to SQLite. class EventBus { private handlers = new Map<string, Set<EventHandler>>(); private db: DrizzleDB; async emit(event: Omit<ForgeEvent, 'id' | 'timestamp'>): Promise<void> { const full: ForgeEvent = { ...event, id: ulid(), timestamp: new Date(), }; // Persist to SQLite (append-only) await this.db.insert(events).values(full); // Notify subscribers const typeHandlers = this.handlers.get(event.type) ?? new Set(); const wildcardHandlers = this.handlers.get('*') ?? new Set(); for (const handler of [...typeHandlers, ...wildcardHandlers]) { handler(full); } } on(type: string, handler: EventHandler): () => void { /* subscribe */ } async replay(traceId: string): Promise<ForgeEvent[]> { return this.db .select() .from(events) .where(eq(events.traceId, traceId)) .orderBy(events.timestamp); } }

Key events emitted by the system:

Event TypeSourceWhen
run.startedOrchestratorPipeline begins
phase.enteredOrchestratorEach phase transition
agent.iterationBase AgentEach loop iteration
tool.executedTool LayerEach tool call
finding.detectedReviewerIssue found in code
test.failedTesterTest failure
gate.requestedSafetyHuman approval needed
gate.approvedSafetyHuman approved
breaker.trippedSafetyCircuit breaker fired
memory.storedMemoryNew learning saved
run.completedOrchestratorPipeline finished

10. Feedback Loops (The Nervous System)

Feedback loops are the P0 foundation — the first thing to build because everything else depends on information flowing back to where it's useful. Five distinct loops operate at different timescales.

10.1 The Five Loops

┌─────────────────────────────────────────────────────────────────────────┐
│                        FEEDBACK LOOP MAP                                 │
│                                                                          │
│  LOOP 1: INNER (ms)          Within one agent iteration                  │
│  ┌──────────────────────────────────────────────┐                       │
│  │  Tool call → Result → Reason → Adjust → Next │ ←── Tightest loop    │
│  └──────────────────────────────────────────────┘                       │
│                                                                          │
│  LOOP 2: PHASE (min)         Between pipeline phases                     │
│  ┌──────────────────────────────────────────────────────────┐           │
│  │  Implement → Review → Findings → Fix → Re-review         │           │
│  │  Implement → Test → Failures → Fix → Re-test             │           │
│  └──────────────────────────────────────────────────────────┘           │
│                                                                          │
│  LOOP 3: RUN (min-hr)        After a full pipeline completes             │
│  ┌──────────────────────────────────────────────────────────────┐       │
│  │  Run completes → Reflect → Extract patterns → Store memory   │       │
│  │  → Next run recalls memories → Better decisions              │       │
│  └──────────────────────────────────────────────────────────────┘       │
│                                                                          │
│  LOOP 4: HUMAN (hr-days)     Human feedback integration                  │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │  Agent suggests → Human dismisses → Confidence decreases         │   │
│  │  Agent suggests → Human approves with edits → Learn preferences  │   │
│  │  Agent misses issue → Human catches it → New pattern learned      │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  LOOP 5: PRODUCTION (days)   Deployed code feedback                      │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │  Deploy → Monitor → Anomaly detected → Correlate with change     │   │
│  │  → Generate bug report → Feed into planning as new task          │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

10.2 Loop 1: Inner Loop (already in agent design)

This is the perceive → reason → act → learn cycle inside every agent. Each tool call result feeds directly back into the next LLM reasoning step. No special infrastructure needed — it's the agent loop itself.

10.3 Loop 2: Phase Loop (the bounce-back)

When Review or Test finds issues, the pipeline doesn't just fail — it bounces back to the Implementer to fix things. This is the most important loop for code quality.

typescript
// ─── In the orchestrator pipeline ───────────────────────── interface PhaseLoopConfig { maxBounces: number; // How many review→fix→review cycles allowed phases: { // After review, if changes requested, bounce back to implement review: { onChangesRequested: 'implementation', // Go back to this phase maxBounces: 3, }, // After test, if failures found and auto-fixable, bounce back testing: { onFailure: 'implementation', maxBounces: 2, }, }; } // How the orchestrator handles bounces: async function runPipelineWithBounces(task: string, ctx: PipelineContext) { const plan = await runPhase('planning', { task }, ctx); let code = await runPhase('implementation', plan, ctx); // Review loop: implement → review → fix → re-review (max 3x) let reviewBounces = 0; while (reviewBounces < 3) { const review = await runPhase('review', code, ctx); if (review.decision === 'approve') break; if (review.decision === 'require_human') { await ctx.gates.requestHumanReview(review); break; } // Bounce back: feed findings to implementer ctx.bus.emit({ type: 'loop.phase_bounce', payload: { from: 'review', to: 'implementation', bounce: ++reviewBounces, findings: review.findings, }}); code = await runPhase('implementation', { ...plan, existingCode: code, fixFindings: review.findings, // ← Feedback flows here }, ctx); } // Test loop: implement → test → fix → re-test (max 2x) let testBounces = 0; while (testBounces < 2) { const tests = await runPhase('testing', code, ctx); if (tests.summary.failed === 0) break; // Only auto-fix if failures are analyzable const fixable = tests.failures.filter(f => f.suggestedFix && f.confidence > 0.7); if (fixable.length === 0) { await ctx.gates.requestHumanHelp(tests.failures); break; } ctx.bus.emit({ type: 'loop.phase_bounce', payload: { from: 'testing', to: 'implementation', bounce: ++testBounces, failures: fixable, }}); code = await runPhase('implementation', { ...plan, existingCode: code, fixFailures: fixable, // ← Feedback flows here }, ctx); } // Deploy only if all gates pass await runPhase('deployment', { code, review, tests }, ctx); }

Key insight: The phase loop carries structured feedback — not just "it failed" but Finding[] and FailureAnalysis[] with specific file/line locations, root causes, and suggested fixes. This is what makes the fix cycle productive rather than a blind retry.

10.4 Loop 3: Run Loop (post-run reflection)

After an entire pipeline run completes (or fails), the system reflects on the whole execution to extract durable learnings.

typescript
// ─── Triggered automatically after every pipeline run ───── interface RunReflection { // What happened in this run? summary: { task: string; outcome: 'success' | 'failure' | 'partial'; phases: PhaseOutcome[]; totalCost: number; totalDuration: number; bounces: { phase: string; count: number }[]; }; // What should we remember? learnings: Learning[]; } interface Learning { type: 'episodic' | 'semantic' | 'procedural'; content: string; // Human-readable insight context: string; // When is this relevant? confidence: number; // How sure are we? source: string; // Which event triggered this? } async function reflectOnRun(traceId: string, ctx: Context): Promise<RunReflection> { // Replay all events from this run const events = await ctx.bus.replay(traceId); // Ask LLM to reflect const reflection = await ctx.llm.chat({ system: RUN_REFLECTION_PROMPT, messages: [{ role: 'user', content: `Here is the execution trace of a pipeline run. Analyze what happened and extract learnings. Focus on: - Patterns that could help future runs - Mistakes to avoid - Strategies that worked well - Any surprises or anomalies Events: ${JSON.stringify(summarizeEvents(events))}`, }], }); // Store each learning in memory for (const learning of reflection.learnings) { await ctx.memory.store(learning); } return reflection; }

Reflection triggers (not just at run completion):

TriggerWhenWhat to Reflect On
Run completedEvery runFull execution trace
Phase bounced 2+ timesDuring runWhy are fixes not sticking?
Cost exceeded 50% of budgetDuring runAre we being inefficient?
Error rate > 10% in a phaseDuring runWhat's going wrong?
Human overrode a decisionOn human inputWhat did we get wrong?

10.5 Loop 4: Human Feedback Loop

This is how the system learns from human behavior — not just explicit feedback, but implicit signals too.

typescript
// ─── Explicit human feedback ────────────────────────────── // Human dismisses a review finding async function onFindingDismissed(findingId: string, reason?: string) { const finding = await db.select().from(findings).where(eq(findings.id, findingId)); // Mark as dismissed await db.update(findings).set({ dismissed: true, dismissedBy: reason }).where(eq(findings.id, findingId)); // Decrease confidence in the pattern that generated this finding const relatedPatterns = await memory.recall({ context: `review finding: ${finding.category} in ${finding.file}`, type: 'semantic', }); for (const pattern of relatedPatterns) { await memory.update(pattern.id, { confidence: pattern.confidence - 0.2, // Significant penalty }); } // Learn from the dismissal await memory.store({ type: 'semantic', content: `Finding "${finding.message}" was dismissed by human. ${reason || 'No reason given.'}`, context: `reviewing ${finding.category} issues in ${finding.file}`, confidence: 0.7, // Human-sourced = higher confidence }); bus.emit({ type: 'feedback.human_dismissed', payload: { findingId, reason } }); } // Human approves with modifications async function onHumanApprovedWithEdits(gateId: string, edits: string) { await memory.store({ type: 'procedural', content: `Human approved but made edits: ${edits}`, context: `gate ${gateId}`, confidence: 0.8, }); bus.emit({ type: 'feedback.human_edited', payload: { gateId, edits } }); } // ─── Implicit human signals ────────────────────────────── // Track which suggestions humans actually apply vs ignore interface ImplicitFeedback { // If human applies the suggested fix → boost confidence onSuggestedFixApplied(findingId: string): void; // If human rewrites the fix differently → learn their preference onSuggestedFixRewritten(findingId: string, humanVersion: string): void; // If human adds a comment the agent didn't catch → learn the gap onHumanAddedComment(pr: string, comment: string): void; // Time-to-dismiss: if dismissed within seconds, it was obviously wrong // If dismissed after minutes, it was at least worth considering onDismissalTiming(findingId: string, timeToDecisionMs: number): void; }

10.6 Loop 5: Production Loop (post-MVP, but designed now)

After deployment, production metrics feed back into the system as new tasks or pattern updates.

typescript
// ─── Post-MVP but the interface is designed now ─────────── interface ProductionFeedback { // Monitor detects anomaly correlated with recent deploy onAnomalyDetected(anomaly: { metric: string; // "error_rate", "latency_p95" baseline: number; current: number; deploymentId: string; // Which deploy caused this? }): Promise<void>; // Error report from production maps back to a code change onProductionError(error: { stack: string; frequency: number; firstSeen: Date; affectedUsers: number; relatedCommit?: string; // Git blame correlation }): Promise<void>; } // In MVP: these interfaces exist but the implementation is a no-op. // Post-MVP: they connect to real monitoring and create new pipeline tasks.

10.7 Feedback Loop Metrics

How do we know the loops are working?

typescript
interface FeedbackMetrics { // Inner loop health avgIterationsPerPhase: number; // Trending down = agents getting smarter toolSuccessRate: number; // Trending up = better tool selection // Phase loop health avgBouncesPerRun: number; // Trending down = better first-pass quality bounceResolutionRate: number; // % of bounces that fix the issue // Run loop health learningsPerRun: number; // Are we extracting value? learningApplicationRate: number; // % of recalled memories that helped memoryPrecision: number; // Recalled memories that were relevant // Human loop health findingDismissalRate: number; // Trending down = fewer false positives humanOverrideRate: number; // Trending down = better autonomous decisions timeToHumanResponse: number; // How fast do humans respond to gates? // Cross-run improvement costPerRun: number; // Trending down = efficiency improving successRateOverTime: number; // Trending up = system is learning firstPassApprovalRate: number; // % of reviews approved without bounces }

10.8 Build Priority for Feedback Loops

This is what makes feedback loops the P0 foundation — they must exist before agents are useful:

Week 1 (build with core):
  ✓ Event bus (emit/subscribe) — enables all loops to capture signals
  ✓ Events table in SQLite — persist signals for later analysis
  ✓ Bus replay — reconstruct what happened in any run

Week 2 (build with memory):
  ✓ Run reflection — Loop 3 (post-run learning)
  ✓ Memory store + recall — the destination for all learnings
  ✓ Confidence scoring — weight learnings by source

Week 3 (build with reviewer):
  ✓ Phase bounce logic — Loop 2 (review→fix→re-review)
  ✓ Finding dismissal tracking — Loop 4 (human feedback)
  ✓ Dismissal → confidence decay — close the human loop

Week 4 (build with tester):
  ✓ Test failure → fix → retest bounce — Loop 2 extension
  ✓ Failure pattern memory — learn from repeated test failures

Week 6 (build with orchestrator):
  ✓ Full phase loop orchestration — Loop 2 with configurable bounces
  ✓ Reflection triggers (cost, error rate, bounces) — Loop 3 enrichment
  ✓ Feedback metrics dashboard — measure loop health

Post-MVP:
  ○ Production monitoring integration — Loop 5
  ○ Implicit human signal tracking — Loop 4 enrichment
  ○ Meta-reflection — reflect on reflection quality

11. Memory & Learning (Where Feedback Lands)

How memories flow through the system:

  Execution Events
       │
       ▼
  ┌─ CAPTURE ──────────────────┐
  │ Every tool result, error,   │
  │ human decision, and test    │
  │ outcome is captured as an   │
  │ event.                      │
  └──────────┬─────────────────┘
             ▼
  ┌─ REFLECT ──────────────────┐
  │ After each agent completes: │
  │ "What worked? What didn't?  │
  │  What should we remember?"  │
  │                              │
  │ LLM extracts learnings as   │
  │ structured memories.         │
  └──────────┬─────────────────┘
             ▼
  ┌─ STORE ────────────────────┐
  │ Episodic: "PR #42 review    │
  │   missed a null check"      │
  │ Semantic: "Auth endpoints    │
  │   in this repo use JWT"     │
  │ Procedural: "When tests     │
  │   fail with mock errors,    │
  │   add clearAllMocks()"      │
  └──────────┬─────────────────┘
             ▼
  ┌─ CONSOLIDATE ──────────────┐  (runs periodically)
  │ Merge similar memories       │
  │ Decay unused memories        │
  │ Promote high-frequency       │
  │   episodes to patterns       │
  │ Prune low-confidence entries │
  └──────────┬─────────────────┘
             ▼
  ┌─ RECALL ───────────────────┐
  │ When an agent starts:        │
  │ "What do I know about this   │
  │  kind of task?"              │
  │                              │
  │ Similarity search on         │
  │ context + tags returns       │
  │ relevant memories to inject  │
  │ into the agent's prompt.     │
  └────────────────────────────┘

Confidence & Decay

Confidence starts at:
  - 0.5  for LLM-extracted learnings (unvalidated)
  - 0.7  for human-confirmed learnings
  - 0.9  for learnings from production outcomes

Confidence changes:
  - +0.1  each time the pattern is successfully applied
  - +0.2  when a human confirms the learning
  - -0.05 per week without access (decay)
  - -0.2  when a human dismisses a suggestion based on it

Pruning:
  - Memories below 0.2 confidence are archived
  - Memories not accessed in 90 days are archived
  - Conflicting memories: keep highest confidence

12. LLM Provider Abstraction

typescript
// ─── llm.ts ─────────────────────────────────────────────── // Provider-agnostic. Swap Claude for OpenAI or local models. interface LLMProvider { chat(request: ChatRequest): Promise<ChatResponse>; embed(text: string): Promise<Float32Array>; } interface ChatRequest { system: string; messages: Message[]; tools?: ToolSchema[]; temperature?: number; maxTokens?: number; } interface ChatResponse { content: string; toolCalls?: ToolCall[]; done: boolean; result?: unknown; usage: { promptTokens: number; completionTokens: number }; cost: number; // USD, calculated from model pricing } // Model selection by task complexity + remaining budget: // // Planning / Architecture → Claude Sonnet (strong reasoning) // Implementation → Claude Sonnet (code generation) // Review (AI layer) → Claude Haiku (fast, cheap, good enough) // Test analysis → Claude Haiku // Reflection → Claude Haiku // Embedding → Local model or API (cheap, fast)

13. Build Order

The research roadmap says "start with feedback loops." I agree, but with a twist: build the skeleton first, then fill in the organs.

Week 1 ── Core Skeleton
├── types.ts (core abstractions from Section 3)
├── bus.ts (event bus)
├── config.ts (safety defaults from Section 8)
├── errors.ts (error taxonomy)
├── schema.ts (Drizzle schema from Section 5)
├── llm.ts (provider abstraction)
└── base.ts (base agent with loop + reflection)

Week 2 ── Memory + Tools
├── store.ts (memory CRUD)
├── episodes.ts + patterns.ts (episodic + semantic memory)
├── registry.ts (tool registry)
├── git.ts, runner.ts, linter.ts (essential tools)
└── First integration test: agent loop with real LLM

Week 3 ── Reviewer Agent (first vertical slice)
├── reviewer.ts (3-layer review: static → security → AI)
├── github.ts (PR integration)
├── Risk scoring
└── Findings persistence + dismissal learning

Week 4 ── Tester Agent
├── tester.ts (test selection + execution + analysis)
├── test-runner.ts (Jest/Vitest integration)
├── Failure analysis with LLM
└── Test gap detection

Week 5 ── Planner + Implementer Agents
├── planner.ts (requirements → plan → tasks)
├── implementer.ts (tasks → code)
├── Self-validation loop (typecheck + test after each change)
└── Feedback from reviewer/tester flows back

Week 6 ── Orchestrator
├── pipeline.ts (state machine: plan → implement → review → test → deploy)
├── checkpoint.ts (save/resume between phases)
├── context.ts (shared state across agents)
├── gates.ts (human approval integration)
└── End-to-end flow: "forge run" works

Week 7 ── CLI + Polish
├── CLI commands: run, review, test, status, history
├── Terminal UI (progress, findings display)
├── forge.config.ts (per-project configuration)
└── Consolidation job (memory pruning + pattern extraction)

Week 8 ── Harden + Document
├── Error recovery (retry, fallback, checkpoint resume)
├── Cost tracking dashboard
├── Real-world testing on actual projects
└── Edge case handling

14. Configuration

One config file per project. Sensible defaults, override what you need.

typescript
// ─── forge.config.ts ────────────────────────────────────── import { defineConfig } from 'forge'; export default defineConfig({ // Project basics name: 'my-app', language: 'typescript', // LLM provider llm: { provider: 'anthropic', // 'anthropic' | 'openai' | 'ollama' model: 'claude-sonnet-4-5-20250929', fastModel: 'claude-haiku-4-5-20251001', // For cheap tasks }, // Tools tools: { testCommand: 'bun test', lintCommand: 'bun run lint', buildCommand: 'bun run build', typecheckCommand: 'bun run typecheck', }, // Safety (override defaults from Section 8) safety: { costPerRun: 50, // USD max costPerDay: 200, automationLevel: 1, // 0-4, see Section 8.3 }, // GitHub integration github: { owner: 'myorg', repo: 'my-app', reviewOnPR: true, // Auto-review new PRs postComments: true, // Post findings as PR comments }, // Memory memory: { dbPath: '.forge/memory.db', // SQLite database location consolidateInterval: '1d', // Run consolidation daily maxMemories: 10_000, // Prune beyond this }, });

15. What This Design Explicitly Defers

These are real concerns acknowledged by the research but not in scope for MVP:

DeferredWhyWhen
Parallel agent executionSequential is simpler, prove it works firstPost-MVP
Kubernetes deploymentSingle Bun process is fine for a toolIf scaling needed
Vector databaseSQLite with manual similarity is enough to startWhen memory > 100K entries
Multi-repo intelligenceFocus on single repo firstQ3+
Autonomous deploymentAlways require human approval for nowAfter Level 3 automation
Natural language requirementsStart with structured task descriptionsQ3+
Real-time dashboardsCLI output + SQLite queries are enoughWhen team size > 1
ClickHouse / KafkaSQLite event table handles MVP observabilityAt scale

This design synthesizes research from 13 topics across agentic loops, feedback mechanisms, code review, testing, CI/CD, orchestration, evaluation, self-improvement, reflection, human-AI collaboration, context management, tool integration, and error recovery.