Forge: Agentic SDLC Orchestrator — System Design
Forge: Agentic SDLC Orchestrator — System Design
An opinionated, buildable design for an AI-driven software development lifecycle tool. Distilled from 13 research topics, 4 synthesis documents, and 70+ interface specifications.
1. What We're Building
Forge is a CLI tool and TypeScript library that orchestrates AI agents through the software development lifecycle. You give it a task — a feature, bug fix, or refactor — and it plans, implements, reviews, tests, and deploys the change. Humans stay in the loop for high-stakes decisions. The system learns from every execution.
It is not a distributed microservice platform. It's a single Bun process backed by SQLite that coordinates LLM calls, tool executions, and human checkpoints through a pipeline.
Design Principles
- Start simple, earn complexity — Sequential pipeline first, parallel agents later
- Learn from everything — Every execution feeds the memory system
- Safe by default — Circuit breakers and human gates baked in, not bolted on
- Observable — Every decision logged with rationale, every action attributed
- Tool-agnostic — Swap LLM providers, CI systems, or git hosts without rewriting agents
2. Architecture Overview
┌──────────────────────────────────────────────────────────────────┐
│ CLI / API │
│ forge run "add user auth" forge review PR#42 forge test │
└────────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ │
│ Pipeline: Plan → Implement → Review → Test → Deploy │
│ State Machine · Checkpoints · Human Gates │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ EVENT BUS │ │
│ │ emit() · on() · replay() · snapshot() │ │
│ └─────────────────────────────────────────────────────────┘ │
└───────┬──────────┬──────────┬──────────┬──────────┬──────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────┐┌─────────┐┌─────────┐┌─────────┐┌─────────┐
│ Planner ││Implement││Reviewer ││ Tester ││Deployer │
│ Agent ││ Agent ││ Agent ││ Agent ││ Agent │
└────┬────┘└────┬────┘└────┬────┘└────┬────┘└────┬────┘
│ │ │ │ │
└──────────┴──────────┴──────────┴──────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ TOOL LAYER │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ LLM │ │ Git │ │GitHub│ │ Test │ │ Lint │ │ Shell│ │
│ │Client│ │ Ops │ │ API │ │Runner│ │/Fmt │ │ Exec │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ MEMORY LAYER │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Episodic │ │ Semantic │ │Procedural │ │ Events │ │
│ │ (what │ │ (patterns │ │ (how to │ │ (audit │ │
│ │ happened)│ │ & facts) │ │ do stuff)│ │ trail) │ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
│ │
│ SQLite via Drizzle ORM │
└──────────────────────────────────────────────────────────────────┘
3. Core Abstractions
Everything in the system is built on 6 types. Every other type composes these.
typescript// ─── The Agent Loop ─────────────────────────────────────── // Every agent follows the same cycle: perceive → reason → act → learn. // The orchestrator runs agents. Agents run tools. Tools do work. interface Agent { id: string; type: AgentType; /** Run one cycle of the agent loop */ execute(input: PhaseInput, ctx: AgentContext): Promise<PhaseOutput>; } // ─── The Event ──────────────────────────────────────────── // Everything that happens is an event. Events are the source of truth. // The memory system, audit trail, and observability all consume events. interface ForgeEvent { id: string; traceId: string; // Groups events in one pipeline run timestamp: Date; source: string; // Which agent/component emitted this type: string; // Dot-namespaced: "review.finding", "test.failed" payload: unknown; cost?: { tokens: number; usd: number }; } // ─── The Tool ───────────────────────────────────────────── // Tools are the hands of agents. An agent reasons about what to do, // then executes a tool to do it. Tools are sandboxed and audited. interface Tool<TInput = unknown, TOutput = unknown> { name: string; description: string; schema: { input: ZodSchema<TInput>; output: ZodSchema<TOutput> }; execute(input: TInput, ctx: ToolContext): Promise<TOutput>; } // ─── The Phase ──────────────────────────────────────────── // The pipeline is a sequence of phases. Each phase has an agent, // input/output types, and safety controls. interface Phase { name: PhaseName; agent: Agent; guards: Guard[]; // Pre-conditions to enter this phase gates: HumanGate[]; // Human approval checkpoints breakers: CircuitBreaker[]; // Safety limits next: PhaseName | null; } // ─── The Memory ─────────────────────────────────────────── // Memories are what the system learned. They have types, relevance // scores, and decay over time if not reinforced. interface Memory { id: string; type: 'episodic' | 'semantic' | 'procedural'; content: string; embedding?: Float32Array; // For similarity search confidence: number; // 0-1, decays without reinforcement context: string; // When is this memory relevant? createdAt: Date; lastAccessed: Date; accessCount: number; } // ─── The Checkpoint ─────────────────────────────────────── // Checkpoints save pipeline state between phases. If something fails, // we can resume from the last checkpoint instead of starting over. interface Checkpoint { id: string; traceId: string; phase: PhaseName; state: Record<string, unknown>; // Serialized phase outputs so far timestamp: Date; }
4. Module Map
forge/
│
├── src/
│ ├── core/ # Foundation — build this first
│ │ ├── types.ts # Core types from Section 3
│ │ ├── bus.ts # In-memory event bus
│ │ ├── config.ts # Runtime configuration + defaults
│ │ └── errors.ts # Error taxonomy (source, severity, recoverability)
│ │
│ ├── safety/ # Guardrails — build alongside core
│ │ ├── breakers.ts # Circuit breakers (iteration, cost, time, error-rate)
│ │ ├── gates.ts # Human approval gates
│ │ └── budget.ts # Cost tracking and limits
│ │
│ ├── memory/ # Learning foundation — Week 1-2
│ │ ├── schema.ts # Drizzle SQLite schema
│ │ ├── store.ts # Memory CRUD + similarity search
│ │ ├── episodes.ts # Episodic memory (what happened)
│ │ ├── patterns.ts # Semantic memory (pattern extraction)
│ │ ├── procedures.ts # Procedural memory (strategies that work)
│ │ └── consolidate.ts # Knowledge consolidation + pruning
│ │
│ ├── tools/ # Tool layer — Week 1-2
│ │ ├── registry.ts # Tool registry + discovery
│ │ ├── sandbox.ts # Execution sandboxing
│ │ ├── llm.ts # LLM provider abstraction
│ │ ├── git.ts # Git operations
│ │ ├── github.ts # GitHub API (PRs, reviews, webhooks)
│ │ ├── runner.ts # Shell command execution
│ │ ├── linter.ts # ESLint/Biome integration
│ │ └── test-runner.ts # Jest/Vitest execution + parsing
│ │
│ ├── agents/ # Agent implementations — Week 3+
│ │ ├── base.ts # Base agent with loop, reflection, safety
│ │ ├── planner.ts # Requirements → architecture → tasks
│ │ ├── implementer.ts # Tasks → code
│ │ ├── reviewer.ts # Code → findings + risk score
│ │ ├── tester.ts # Code → tests → results → analysis
│ │ └── deployer.ts # Artifact → canary → rollout
│ │
│ ├── orchestrator/ # Pipeline coordination — Week 7-8
│ │ ├── pipeline.ts # Phase sequencing state machine
│ │ ├── checkpoint.ts # State persistence between phases
│ │ └── context.ts # Shared context across agents
│ │
│ └── cli/ # User interface
│ ├── index.ts # CLI entry point
│ ├── commands/ # run, review, test, status, etc.
│ └── ui.ts # Terminal output formatting
│
├── drizzle/ # Database migrations
├── forge.config.ts # Project-level configuration
└── package.json
5. Data Model (SQLite / Drizzle)
This is the ground truth. Everything the system knows lives here.
typescript// ─── schema.ts ──────────────────────────────────────────── import { sqliteTable, text, integer, real, blob } from 'drizzle-orm/sqlite-core'; // ─── Events (Append-only audit trail) ───────────────────── export const events = sqliteTable('events', { id: text('id').primaryKey(), // ulid traceId: text('trace_id').notNull(), // groups one pipeline run timestamp: integer('timestamp', { mode: 'timestamp_ms' }).notNull(), source: text('source').notNull(), // agent or component id type: text('type').notNull(), // "plan.started", "review.finding", etc. phase: text('phase'), // current pipeline phase payload: text('payload', { mode: 'json' }), // event-specific data tokensUsed:integer('tokens_used'), costUsd: real('cost_usd'), durationMs:integer('duration_ms'), }); // ─── Memories (What the system has learned) ─────────────── export const memories = sqliteTable('memories', { id: text('id').primaryKey(), type: text('type').notNull(), // episodic | semantic | procedural content: text('content').notNull(), // Human-readable description context: text('context').notNull(), // When is this relevant? embedding: blob('embedding'), // Float32Array for similarity search confidence: real('confidence').notNull(), // 0.0 - 1.0 source: text('source'), // What event created this? tags: text('tags', { mode: 'json' }), // ["typescript", "testing", "auth"] createdAt: integer('created_at', { mode: 'timestamp_ms' }).notNull(), lastAccessed:integer('last_accessed', { mode: 'timestamp_ms' }).notNull(), accessCount: integer('access_count').notNull().default(0), }); // ─── Patterns (Extracted from episodes) ─────────────────── export const patterns = sqliteTable('patterns', { id: text('id').primaryKey(), type: text('type').notNull(), // success | failure | approach trigger: text('trigger').notNull(), // What situation activates this? pattern: text('pattern').notNull(), // The pattern itself resolution: text('resolution'), // What to do when triggered frequency: integer('frequency').notNull().default(1), successRate: real('success_rate'), // How often this works confidence: real('confidence').notNull(), lastSeen: integer('last_seen', { mode: 'timestamp_ms' }).notNull(), }); // ─── Checkpoints (Pipeline state snapshots) ─────────────── export const checkpoints = sqliteTable('checkpoints', { id: text('id').primaryKey(), traceId: text('trace_id').notNull(), phase: text('phase').notNull(), state: text('state', { mode: 'json' }).notNull(), timestamp: integer('timestamp', { mode: 'timestamp_ms' }).notNull(), }); // ─── Runs (Pipeline execution history) ──────────────────── export const runs = sqliteTable('runs', { id: text('id').primaryKey(), // = traceId task: text('task').notNull(), // Human description of what was requested status: text('status').notNull(), // pending | running | completed | failed | cancelled currentPhase:text('current_phase'), config: text('config', { mode: 'json' }), // Runtime config snapshot startedAt: integer('started_at', { mode: 'timestamp_ms' }).notNull(), completedAt: integer('completed_at', { mode: 'timestamp_ms' }), totalCostUsd:real('total_cost_usd').default(0), totalTokens: integer('total_tokens').default(0), error: text('error'), // If failed, why }); // ─── Findings (Review/test issues) ──────────────────────── export const findings = sqliteTable('findings', { id: text('id').primaryKey(), runId: text('run_id').notNull(), phase: text('phase').notNull(), // review | test severity: text('severity').notNull(), // info | warning | error | critical category: text('category').notNull(), // style | security | correctness | performance message: text('message').notNull(), file: text('file'), line: integer('line'), confidence: real('confidence'), fixable: integer('fixable', { mode: 'boolean' }), fix: text('fix'), // Suggested code change dismissed: integer('dismissed', { mode: 'boolean' }).default(false), dismissedBy:text('dismissed_by'), // Who dismissed and why — for learning });
6. The Agent Loop
Every agent runs the same core loop. The only thing that changes is the tools available and the reasoning prompt.
┌──────────────────────────────────────────────────┐
│ AGENT LOOP │
│ │
│ ┌──────────┐ │
│ │ PERCEIVE │ ← Gather context: │
│ │ │ - Task/phase input │
│ │ │ - Relevant memories │
│ │ │ - Previous iteration results │
│ └────┬─────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ REASON │ ← LLM decides: │
│ │ │ - What tool to use next │
│ │ │ - Or: task is complete │
│ │ │ - Or: need human input │
│ └────┬─────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ ACT │ ← Execute tool: │
│ │ │ - Validate input (Zod) │
│ │ │ - Run in sandbox │
│ │ │ - Capture result + metrics │
│ └────┬─────┘ │
│ ▼ │
│ ┌──────────┐ │
│ │ LEARN │ ← After each iteration: │
│ │ │ - Log event to bus │
│ │ │ - Check circuit breakers │
│ │ │ - Update working memory │
│ │ │ - Reflect if error occurred │
│ └────┬─────┘ │
│ │ │
│ ├── Continue? ──▶ Loop back to PERCEIVE │
│ ├── Done? ──────▶ Return PhaseOutput │
│ ├── Stuck? ─────▶ Escalate to human │
│ └── Breaker? ───▶ Halt with error │
└──────────────────────────────────────────────────┘
typescript// ─── base.ts ────────────────────────────────────────────── abstract class BaseAgent implements Agent { abstract type: AgentType; abstract tools: Tool[]; abstract systemPrompt: string; async execute(input: PhaseInput, ctx: AgentContext): Promise<PhaseOutput> { let iteration = 0; let workingMemory = await this.perceive(input, ctx); while (true) { iteration++; // ── Safety check ── const breakerResult = ctx.safety.check({ iteration, cost: ctx.cost, elapsed: ctx.elapsed }); if (breakerResult.shouldBreak) { ctx.bus.emit({ type: `${this.type}.breaker_tripped`, payload: breakerResult }); throw new CircuitBreakerError(breakerResult); } // ── Reason: ask LLM what to do ── const decision = await ctx.llm.chat({ system: this.systemPrompt, messages: workingMemory.messages, tools: this.tools.map(t => t.schema), }); // ── Done? ── if (decision.done) { const output = decision.result as PhaseOutput; ctx.bus.emit({ type: `${this.type}.completed`, payload: output }); await this.reflect(ctx, 'success'); return output; } // ── Act: execute the chosen tool ── const tool = this.tools.find(t => t.name === decision.toolCall.name); const result = await this.executeTool(tool, decision.toolCall.input, ctx); // ── Learn: update context ── workingMemory = this.updateWorkingMemory(workingMemory, decision, result); if (result.error) { await this.reflect(ctx, 'error', result.error); } } } private async perceive(input: PhaseInput, ctx: AgentContext): Promise<WorkingMemory> { const relevantMemories = await ctx.memory.recall({ context: input.task, type: this.type, limit: 10, }); return { messages: [ { role: 'user', content: this.buildPrompt(input, relevantMemories) }, ], }; } private async reflect(ctx: AgentContext, outcome: string, error?: Error) { // Post-execution reflection: extract learnings const reflection = await ctx.llm.chat({ system: REFLECTION_PROMPT, messages: [{ role: 'user', content: `Outcome: ${outcome}. ${error ? `Error: ${error.message}` : ''}\nWhat should we remember for next time?` }], }); if (reflection.learnings) { for (const learning of reflection.learnings) { await ctx.memory.store({ type: 'procedural', content: learning.content, context: learning.context, confidence: learning.confidence, }); } } } }
7. Agent Designs
7.1 Planner Agent
Input: Natural language task description
Output: Implementation plan with architecture, tasks, risk assessment
Tools: read_file, glob, grep, llm_analyze
"Add user authentication"
│
▼
┌─ PERCEIVE ─┐
│ Read existing codebase structure │
│ Recall patterns for "auth" from memory │
│ Check for existing auth utilities │
└──────┬──────┘
▼
┌── REASON ──┐
│ Decompose into tasks: │
│ 1. Design auth schema │
│ 2. Create login/register endpoints │
│ 3. Add session middleware │
│ 4. Protect routes │
│ Estimate risk: MEDIUM (new feature) │
│ Identify dependencies │
└──────┬──────┘
▼
OUTPUT: ImplementationPlan {
architecture: { components, interfaces, decisions }
tasks: Task[] // ordered, with dependencies
risk: RiskAssessment // determines review depth
estimates: { complexity, effort }
}
7.2 Implementer Agent
Input: Implementation plan + task list
Output: Code changes (files modified/created)
Tools: read_file, write_file, run_command, llm_generate, search_code
ImplementationPlan
│
▼
For each task (sequential MVP, parallel later):
┌─ PERCEIVE ─┐
│ Read target files │
│ Understand existing patterns │
│ Load procedural memories for this domain │
└──────┬──────┘
▼
┌── REASON ──┐
│ Generate code change │
│ Self-validate: does this match the spec? │
│ Check for obvious issues │
└──────┬──────┘
▼
┌─── ACT ────┐
│ Write files │
│ Run typecheck │
│ Run affected tests │
│ Fix issues if found, loop back │
└──────┬──────┘
▼
OUTPUT: CodeChanges {
files: FileChange[] // path, before, after
testsAdded: string[] // new test files
validated: boolean // typecheck + tests pass
}
7.3 Reviewer Agent
Input: Code changes (diff)
Output: Review with findings, risk score, gate decision
Tools: run_linter, run_security_scan, llm_review, read_file
CodeChanges
│
▼
┌─ Layer 1: Static Analysis ──┐ (fast, cheap, deterministic)
│ ESLint / Biome │
│ TypeScript strict check │
│ Formatting check │
└──────────┬───────────────────┘
▼
┌─ Layer 2: Security Scan ────┐ (fast, important)
│ Secret detection │
│ Dependency vulnerability │
│ Known insecure patterns │
└──────────┬───────────────────┘
▼
┌─ Layer 3: AI Review ────────┐ (slow, expensive — only if risk > low)
│ Logic correctness │
│ Edge cases │
│ Performance implications │
│ Architecture fit │
└──────────┬───────────────────┘
▼
┌─ Synthesis ─────────────────┐
│ Deduplicate findings │
│ Calculate risk score │
│ Determine gate decision │
└──────────┬───────────────────┘
▼
OUTPUT: ReviewResult {
findings: Finding[]
riskScore: { total, level: 'low'|'medium'|'high'|'critical' }
decision: 'approve' | 'request_changes' | 'require_human'
}
Risk-based review depth:
| Risk Level | Static | Security | AI Review | Human Required |
|---|---|---|---|---|
| Low | Yes | Yes | No | No |
| Medium | Yes | Yes | Yes | Optional |
| High | Yes | Yes | Yes | Yes |
| Critical | Yes | Yes | Yes | Yes + Architect |
7.4 Tester Agent
Input: Code changes + existing test suite
Output: Test results + failure analysis + coverage
Tools: run_tests, llm_analyze, read_file, write_file
CodeChanges
│
▼
┌─ Select ────────────────────┐
│ Which tests to run? │
│ - Tests covering changed files (always)
│ - Related integration tests (if medium+ risk)
│ - Full suite (if high+ risk)
└──────────┬───────────────────┘
▼
┌─ Execute ───────────────────┐
│ Run selected tests │
│ Retry failures once (flaky?) │
│ Collect coverage │
└──────────┬───────────────────┘
▼
┌─ Analyze ───────────────────┐
│ If failures: │
│ - Classify: real bug vs flaky vs env issue
│ - Root cause analysis (LLM)
│ - Suggest fix │
│ If low coverage: │
│ - Identify gaps │
│ - Generate missing tests │
└──────────┬───────────────────┘
▼
OUTPUT: TestResult {
summary: { total, passed, failed, skipped }
coverage: { line, branch, function, diff }
failures: FailureAnalysis[] // with root cause + fix
generatedTests: TestFile[] // new tests if gaps found
}
7.5 Deployer Agent
Input: Validated code + test results
Output: Deployment status
Tools: run_command, github_api, read_file
ValidatedCode + TestResults
│
▼
┌─ Build ─────────────────────┐
│ Run build command │
│ Verify artifact │
└──────────┬───────────────────┘
▼
┌─ Gate: Human Approval ──────┐ (always for production)
│ Show summary: │
│ - What changed │
│ - Risk score │
│ - Test results │
│ - Findings │
│ Wait for approval │
└──────────┬───────────────────┘
▼
┌─ Deploy ────────────────────┐
│ Strategy based on risk: │
│ - Low: direct deploy │
│ - Medium: canary (5%→25%→100%)
│ - High: canary (5%→10%→25%→50%→100%)
└──────────┬───────────────────┘
▼
┌─ Verify ────────────────────┐
│ Health check endpoints │
│ Error rate vs baseline │
│ Latency vs baseline │
│ Auto-rollback if unhealthy │
└──────────┬───────────────────┘
▼
OUTPUT: DeploymentResult {
status: 'healthy' | 'degraded' | 'rolled_back'
metrics: { errorRate, latency, throughput }
url: string
}
8. Safety System
8.1 Circuit Breakers
Four breakers run continuously. Any one can halt execution.
typescriptinterface SafetyConfig { breakers: { iteration: { default: 10, planning: 20, implementation: 50, testing: 5, deployment: 3, stagnationThreshold: 3, // Consecutive iterations with no progress }; cost: { perPhase: { // USD planning: 5, implementation: 10, review: 2, testing: 3, deployment: 2, }, perRun: 50, perDay: 200, }; time: { // milliseconds planning: 30 * 60_000, // 30 min implementation: 60 * 60_000, // 1 hour review: 30 * 60_000, // 30 min testing: 20 * 60_000, // 20 min deployment: 15 * 60_000, // 15 min totalPipeline: 120 * 60_000, // 2 hours }; errorRate: { window: 5 * 60_000, // 5 minute sliding window warning: 0.10, // 10% critical: 0.25, // 25% → halt }; }; }
8.2 Human Gates
typescript// Gates are checkpoints where execution pauses for human input. // They fire based on conditions, not every time. const GATES: HumanGate[] = [ { id: 'architecture_approval', phase: 'planning', condition: (plan) => plan.risk.level === 'high' || plan.risk.level === 'critical', prompt: 'Review proposed architecture before implementation begins.', timeout: 24 * 60 * 60_000, // 24 hours }, { id: 'production_deploy', phase: 'deployment', condition: (ctx) => ctx.environment === 'production', prompt: 'Approve production deployment.', timeout: 60 * 60_000, // 1 hour }, { id: 'security_findings', phase: 'review', condition: (review) => review.findings.some(f => f.severity === 'critical' && f.category === 'security'), prompt: 'Critical security finding requires human review.', timeout: 12 * 60 * 60_000, // 12 hours }, { id: 'cost_overrun', phase: '*', condition: (ctx) => ctx.cost.current > ctx.cost.budget * 0.8, prompt: 'Approaching cost budget. Continue?', timeout: 2 * 60 * 60_000, // 2 hours }, ];
8.3 Automation Ladder
The system starts conservative and earns autonomy based on track record.
Level 0 ─── Human does everything (current state)
│
│ After: system deployed, basic metrics working
▼
Level 1 ─── AI suggests, human decides
│ - Review comments are suggestions only
│ - Test failures analyzed but human fixes
│ - Deploy requires explicit approval
│
│ After: false positive rate < 20%, 50+ successful runs
▼
Level 2 ─── AI acts, human reviews
│ - Auto-fix formatting and simple lint issues
│ - Auto-approve low-risk reviews
│ - Still requires human for medium+ risk
│
│ After: 200+ runs, <5% false positive rate, 0 missed critical bugs
▼
Level 3 ─── AI acts, human notified
│ - Auto-merge low-risk PRs
│ - Auto-deploy to staging
│ - Human notified, can override within window
│
│ After: 500+ runs, proven safety record
▼
Level 4 ─── Full autonomy (low-risk only)
- Fully autonomous for low-risk changes
- Human gates remain for medium+ risk
- Human gates ALWAYS remain for production deploys
9. Event Bus & Observability
typescript// ─── bus.ts ─────────────────────────────────────────────── // Simple in-memory pub/sub. Events are also persisted to SQLite. class EventBus { private handlers = new Map<string, Set<EventHandler>>(); private db: DrizzleDB; async emit(event: Omit<ForgeEvent, 'id' | 'timestamp'>): Promise<void> { const full: ForgeEvent = { ...event, id: ulid(), timestamp: new Date(), }; // Persist to SQLite (append-only) await this.db.insert(events).values(full); // Notify subscribers const typeHandlers = this.handlers.get(event.type) ?? new Set(); const wildcardHandlers = this.handlers.get('*') ?? new Set(); for (const handler of [...typeHandlers, ...wildcardHandlers]) { handler(full); } } on(type: string, handler: EventHandler): () => void { /* subscribe */ } async replay(traceId: string): Promise<ForgeEvent[]> { return this.db .select() .from(events) .where(eq(events.traceId, traceId)) .orderBy(events.timestamp); } }
Key events emitted by the system:
| Event Type | Source | When |
|---|---|---|
run.started | Orchestrator | Pipeline begins |
phase.entered | Orchestrator | Each phase transition |
agent.iteration | Base Agent | Each loop iteration |
tool.executed | Tool Layer | Each tool call |
finding.detected | Reviewer | Issue found in code |
test.failed | Tester | Test failure |
gate.requested | Safety | Human approval needed |
gate.approved | Safety | Human approved |
breaker.tripped | Safety | Circuit breaker fired |
memory.stored | Memory | New learning saved |
run.completed | Orchestrator | Pipeline finished |
10. Feedback Loops (The Nervous System)
Feedback loops are the P0 foundation — the first thing to build because everything else depends on information flowing back to where it's useful. Five distinct loops operate at different timescales.
10.1 The Five Loops
┌─────────────────────────────────────────────────────────────────────────┐
│ FEEDBACK LOOP MAP │
│ │
│ LOOP 1: INNER (ms) Within one agent iteration │
│ ┌──────────────────────────────────────────────┐ │
│ │ Tool call → Result → Reason → Adjust → Next │ ←── Tightest loop │
│ └──────────────────────────────────────────────┘ │
│ │
│ LOOP 2: PHASE (min) Between pipeline phases │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Implement → Review → Findings → Fix → Re-review │ │
│ │ Implement → Test → Failures → Fix → Re-test │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ LOOP 3: RUN (min-hr) After a full pipeline completes │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Run completes → Reflect → Extract patterns → Store memory │ │
│ │ → Next run recalls memories → Better decisions │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ LOOP 4: HUMAN (hr-days) Human feedback integration │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Agent suggests → Human dismisses → Confidence decreases │ │
│ │ Agent suggests → Human approves with edits → Learn preferences │ │
│ │ Agent misses issue → Human catches it → New pattern learned │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ LOOP 5: PRODUCTION (days) Deployed code feedback │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Deploy → Monitor → Anomaly detected → Correlate with change │ │
│ │ → Generate bug report → Feed into planning as new task │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
10.2 Loop 1: Inner Loop (already in agent design)
This is the perceive → reason → act → learn cycle inside every agent. Each tool call result feeds directly back into the next LLM reasoning step. No special infrastructure needed — it's the agent loop itself.
10.3 Loop 2: Phase Loop (the bounce-back)
When Review or Test finds issues, the pipeline doesn't just fail — it bounces back to the Implementer to fix things. This is the most important loop for code quality.
typescript// ─── In the orchestrator pipeline ───────────────────────── interface PhaseLoopConfig { maxBounces: number; // How many review→fix→review cycles allowed phases: { // After review, if changes requested, bounce back to implement review: { onChangesRequested: 'implementation', // Go back to this phase maxBounces: 3, }, // After test, if failures found and auto-fixable, bounce back testing: { onFailure: 'implementation', maxBounces: 2, }, }; } // How the orchestrator handles bounces: async function runPipelineWithBounces(task: string, ctx: PipelineContext) { const plan = await runPhase('planning', { task }, ctx); let code = await runPhase('implementation', plan, ctx); // Review loop: implement → review → fix → re-review (max 3x) let reviewBounces = 0; while (reviewBounces < 3) { const review = await runPhase('review', code, ctx); if (review.decision === 'approve') break; if (review.decision === 'require_human') { await ctx.gates.requestHumanReview(review); break; } // Bounce back: feed findings to implementer ctx.bus.emit({ type: 'loop.phase_bounce', payload: { from: 'review', to: 'implementation', bounce: ++reviewBounces, findings: review.findings, }}); code = await runPhase('implementation', { ...plan, existingCode: code, fixFindings: review.findings, // ← Feedback flows here }, ctx); } // Test loop: implement → test → fix → re-test (max 2x) let testBounces = 0; while (testBounces < 2) { const tests = await runPhase('testing', code, ctx); if (tests.summary.failed === 0) break; // Only auto-fix if failures are analyzable const fixable = tests.failures.filter(f => f.suggestedFix && f.confidence > 0.7); if (fixable.length === 0) { await ctx.gates.requestHumanHelp(tests.failures); break; } ctx.bus.emit({ type: 'loop.phase_bounce', payload: { from: 'testing', to: 'implementation', bounce: ++testBounces, failures: fixable, }}); code = await runPhase('implementation', { ...plan, existingCode: code, fixFailures: fixable, // ← Feedback flows here }, ctx); } // Deploy only if all gates pass await runPhase('deployment', { code, review, tests }, ctx); }
Key insight: The phase loop carries structured feedback — not just "it failed" but Finding[] and FailureAnalysis[] with specific file/line locations, root causes, and suggested fixes. This is what makes the fix cycle productive rather than a blind retry.
10.4 Loop 3: Run Loop (post-run reflection)
After an entire pipeline run completes (or fails), the system reflects on the whole execution to extract durable learnings.
typescript// ─── Triggered automatically after every pipeline run ───── interface RunReflection { // What happened in this run? summary: { task: string; outcome: 'success' | 'failure' | 'partial'; phases: PhaseOutcome[]; totalCost: number; totalDuration: number; bounces: { phase: string; count: number }[]; }; // What should we remember? learnings: Learning[]; } interface Learning { type: 'episodic' | 'semantic' | 'procedural'; content: string; // Human-readable insight context: string; // When is this relevant? confidence: number; // How sure are we? source: string; // Which event triggered this? } async function reflectOnRun(traceId: string, ctx: Context): Promise<RunReflection> { // Replay all events from this run const events = await ctx.bus.replay(traceId); // Ask LLM to reflect const reflection = await ctx.llm.chat({ system: RUN_REFLECTION_PROMPT, messages: [{ role: 'user', content: `Here is the execution trace of a pipeline run. Analyze what happened and extract learnings. Focus on: - Patterns that could help future runs - Mistakes to avoid - Strategies that worked well - Any surprises or anomalies Events: ${JSON.stringify(summarizeEvents(events))}`, }], }); // Store each learning in memory for (const learning of reflection.learnings) { await ctx.memory.store(learning); } return reflection; }
Reflection triggers (not just at run completion):
| Trigger | When | What to Reflect On |
|---|---|---|
| Run completed | Every run | Full execution trace |
| Phase bounced 2+ times | During run | Why are fixes not sticking? |
| Cost exceeded 50% of budget | During run | Are we being inefficient? |
| Error rate > 10% in a phase | During run | What's going wrong? |
| Human overrode a decision | On human input | What did we get wrong? |
10.5 Loop 4: Human Feedback Loop
This is how the system learns from human behavior — not just explicit feedback, but implicit signals too.
typescript// ─── Explicit human feedback ────────────────────────────── // Human dismisses a review finding async function onFindingDismissed(findingId: string, reason?: string) { const finding = await db.select().from(findings).where(eq(findings.id, findingId)); // Mark as dismissed await db.update(findings).set({ dismissed: true, dismissedBy: reason }).where(eq(findings.id, findingId)); // Decrease confidence in the pattern that generated this finding const relatedPatterns = await memory.recall({ context: `review finding: ${finding.category} in ${finding.file}`, type: 'semantic', }); for (const pattern of relatedPatterns) { await memory.update(pattern.id, { confidence: pattern.confidence - 0.2, // Significant penalty }); } // Learn from the dismissal await memory.store({ type: 'semantic', content: `Finding "${finding.message}" was dismissed by human. ${reason || 'No reason given.'}`, context: `reviewing ${finding.category} issues in ${finding.file}`, confidence: 0.7, // Human-sourced = higher confidence }); bus.emit({ type: 'feedback.human_dismissed', payload: { findingId, reason } }); } // Human approves with modifications async function onHumanApprovedWithEdits(gateId: string, edits: string) { await memory.store({ type: 'procedural', content: `Human approved but made edits: ${edits}`, context: `gate ${gateId}`, confidence: 0.8, }); bus.emit({ type: 'feedback.human_edited', payload: { gateId, edits } }); } // ─── Implicit human signals ────────────────────────────── // Track which suggestions humans actually apply vs ignore interface ImplicitFeedback { // If human applies the suggested fix → boost confidence onSuggestedFixApplied(findingId: string): void; // If human rewrites the fix differently → learn their preference onSuggestedFixRewritten(findingId: string, humanVersion: string): void; // If human adds a comment the agent didn't catch → learn the gap onHumanAddedComment(pr: string, comment: string): void; // Time-to-dismiss: if dismissed within seconds, it was obviously wrong // If dismissed after minutes, it was at least worth considering onDismissalTiming(findingId: string, timeToDecisionMs: number): void; }
10.6 Loop 5: Production Loop (post-MVP, but designed now)
After deployment, production metrics feed back into the system as new tasks or pattern updates.
typescript// ─── Post-MVP but the interface is designed now ─────────── interface ProductionFeedback { // Monitor detects anomaly correlated with recent deploy onAnomalyDetected(anomaly: { metric: string; // "error_rate", "latency_p95" baseline: number; current: number; deploymentId: string; // Which deploy caused this? }): Promise<void>; // Error report from production maps back to a code change onProductionError(error: { stack: string; frequency: number; firstSeen: Date; affectedUsers: number; relatedCommit?: string; // Git blame correlation }): Promise<void>; } // In MVP: these interfaces exist but the implementation is a no-op. // Post-MVP: they connect to real monitoring and create new pipeline tasks.
10.7 Feedback Loop Metrics
How do we know the loops are working?
typescriptinterface FeedbackMetrics { // Inner loop health avgIterationsPerPhase: number; // Trending down = agents getting smarter toolSuccessRate: number; // Trending up = better tool selection // Phase loop health avgBouncesPerRun: number; // Trending down = better first-pass quality bounceResolutionRate: number; // % of bounces that fix the issue // Run loop health learningsPerRun: number; // Are we extracting value? learningApplicationRate: number; // % of recalled memories that helped memoryPrecision: number; // Recalled memories that were relevant // Human loop health findingDismissalRate: number; // Trending down = fewer false positives humanOverrideRate: number; // Trending down = better autonomous decisions timeToHumanResponse: number; // How fast do humans respond to gates? // Cross-run improvement costPerRun: number; // Trending down = efficiency improving successRateOverTime: number; // Trending up = system is learning firstPassApprovalRate: number; // % of reviews approved without bounces }
10.8 Build Priority for Feedback Loops
This is what makes feedback loops the P0 foundation — they must exist before agents are useful:
Week 1 (build with core):
✓ Event bus (emit/subscribe) — enables all loops to capture signals
✓ Events table in SQLite — persist signals for later analysis
✓ Bus replay — reconstruct what happened in any run
Week 2 (build with memory):
✓ Run reflection — Loop 3 (post-run learning)
✓ Memory store + recall — the destination for all learnings
✓ Confidence scoring — weight learnings by source
Week 3 (build with reviewer):
✓ Phase bounce logic — Loop 2 (review→fix→re-review)
✓ Finding dismissal tracking — Loop 4 (human feedback)
✓ Dismissal → confidence decay — close the human loop
Week 4 (build with tester):
✓ Test failure → fix → retest bounce — Loop 2 extension
✓ Failure pattern memory — learn from repeated test failures
Week 6 (build with orchestrator):
✓ Full phase loop orchestration — Loop 2 with configurable bounces
✓ Reflection triggers (cost, error rate, bounces) — Loop 3 enrichment
✓ Feedback metrics dashboard — measure loop health
Post-MVP:
○ Production monitoring integration — Loop 5
○ Implicit human signal tracking — Loop 4 enrichment
○ Meta-reflection — reflect on reflection quality
11. Memory & Learning (Where Feedback Lands)
How memories flow through the system:
Execution Events
│
▼
┌─ CAPTURE ──────────────────┐
│ Every tool result, error, │
│ human decision, and test │
│ outcome is captured as an │
│ event. │
└──────────┬─────────────────┘
▼
┌─ REFLECT ──────────────────┐
│ After each agent completes: │
│ "What worked? What didn't? │
│ What should we remember?" │
│ │
│ LLM extracts learnings as │
│ structured memories. │
└──────────┬─────────────────┘
▼
┌─ STORE ────────────────────┐
│ Episodic: "PR #42 review │
│ missed a null check" │
│ Semantic: "Auth endpoints │
│ in this repo use JWT" │
│ Procedural: "When tests │
│ fail with mock errors, │
│ add clearAllMocks()" │
└──────────┬─────────────────┘
▼
┌─ CONSOLIDATE ──────────────┐ (runs periodically)
│ Merge similar memories │
│ Decay unused memories │
│ Promote high-frequency │
│ episodes to patterns │
│ Prune low-confidence entries │
└──────────┬─────────────────┘
▼
┌─ RECALL ───────────────────┐
│ When an agent starts: │
│ "What do I know about this │
│ kind of task?" │
│ │
│ Similarity search on │
│ context + tags returns │
│ relevant memories to inject │
│ into the agent's prompt. │
└────────────────────────────┘
Confidence & Decay
Confidence starts at:
- 0.5 for LLM-extracted learnings (unvalidated)
- 0.7 for human-confirmed learnings
- 0.9 for learnings from production outcomes
Confidence changes:
- +0.1 each time the pattern is successfully applied
- +0.2 when a human confirms the learning
- -0.05 per week without access (decay)
- -0.2 when a human dismisses a suggestion based on it
Pruning:
- Memories below 0.2 confidence are archived
- Memories not accessed in 90 days are archived
- Conflicting memories: keep highest confidence
12. LLM Provider Abstraction
typescript// ─── llm.ts ─────────────────────────────────────────────── // Provider-agnostic. Swap Claude for OpenAI or local models. interface LLMProvider { chat(request: ChatRequest): Promise<ChatResponse>; embed(text: string): Promise<Float32Array>; } interface ChatRequest { system: string; messages: Message[]; tools?: ToolSchema[]; temperature?: number; maxTokens?: number; } interface ChatResponse { content: string; toolCalls?: ToolCall[]; done: boolean; result?: unknown; usage: { promptTokens: number; completionTokens: number }; cost: number; // USD, calculated from model pricing } // Model selection by task complexity + remaining budget: // // Planning / Architecture → Claude Sonnet (strong reasoning) // Implementation → Claude Sonnet (code generation) // Review (AI layer) → Claude Haiku (fast, cheap, good enough) // Test analysis → Claude Haiku // Reflection → Claude Haiku // Embedding → Local model or API (cheap, fast)
13. Build Order
The research roadmap says "start with feedback loops." I agree, but with a twist: build the skeleton first, then fill in the organs.
Week 1 ── Core Skeleton
├── types.ts (core abstractions from Section 3)
├── bus.ts (event bus)
├── config.ts (safety defaults from Section 8)
├── errors.ts (error taxonomy)
├── schema.ts (Drizzle schema from Section 5)
├── llm.ts (provider abstraction)
└── base.ts (base agent with loop + reflection)
Week 2 ── Memory + Tools
├── store.ts (memory CRUD)
├── episodes.ts + patterns.ts (episodic + semantic memory)
├── registry.ts (tool registry)
├── git.ts, runner.ts, linter.ts (essential tools)
└── First integration test: agent loop with real LLM
Week 3 ── Reviewer Agent (first vertical slice)
├── reviewer.ts (3-layer review: static → security → AI)
├── github.ts (PR integration)
├── Risk scoring
└── Findings persistence + dismissal learning
Week 4 ── Tester Agent
├── tester.ts (test selection + execution + analysis)
├── test-runner.ts (Jest/Vitest integration)
├── Failure analysis with LLM
└── Test gap detection
Week 5 ── Planner + Implementer Agents
├── planner.ts (requirements → plan → tasks)
├── implementer.ts (tasks → code)
├── Self-validation loop (typecheck + test after each change)
└── Feedback from reviewer/tester flows back
Week 6 ── Orchestrator
├── pipeline.ts (state machine: plan → implement → review → test → deploy)
├── checkpoint.ts (save/resume between phases)
├── context.ts (shared state across agents)
├── gates.ts (human approval integration)
└── End-to-end flow: "forge run" works
Week 7 ── CLI + Polish
├── CLI commands: run, review, test, status, history
├── Terminal UI (progress, findings display)
├── forge.config.ts (per-project configuration)
└── Consolidation job (memory pruning + pattern extraction)
Week 8 ── Harden + Document
├── Error recovery (retry, fallback, checkpoint resume)
├── Cost tracking dashboard
├── Real-world testing on actual projects
└── Edge case handling
14. Configuration
One config file per project. Sensible defaults, override what you need.
typescript// ─── forge.config.ts ────────────────────────────────────── import { defineConfig } from 'forge'; export default defineConfig({ // Project basics name: 'my-app', language: 'typescript', // LLM provider llm: { provider: 'anthropic', // 'anthropic' | 'openai' | 'ollama' model: 'claude-sonnet-4-5-20250929', fastModel: 'claude-haiku-4-5-20251001', // For cheap tasks }, // Tools tools: { testCommand: 'bun test', lintCommand: 'bun run lint', buildCommand: 'bun run build', typecheckCommand: 'bun run typecheck', }, // Safety (override defaults from Section 8) safety: { costPerRun: 50, // USD max costPerDay: 200, automationLevel: 1, // 0-4, see Section 8.3 }, // GitHub integration github: { owner: 'myorg', repo: 'my-app', reviewOnPR: true, // Auto-review new PRs postComments: true, // Post findings as PR comments }, // Memory memory: { dbPath: '.forge/memory.db', // SQLite database location consolidateInterval: '1d', // Run consolidation daily maxMemories: 10_000, // Prune beyond this }, });
15. What This Design Explicitly Defers
These are real concerns acknowledged by the research but not in scope for MVP:
| Deferred | Why | When |
|---|---|---|
| Parallel agent execution | Sequential is simpler, prove it works first | Post-MVP |
| Kubernetes deployment | Single Bun process is fine for a tool | If scaling needed |
| Vector database | SQLite with manual similarity is enough to start | When memory > 100K entries |
| Multi-repo intelligence | Focus on single repo first | Q3+ |
| Autonomous deployment | Always require human approval for now | After Level 3 automation |
| Natural language requirements | Start with structured task descriptions | Q3+ |
| Real-time dashboards | CLI output + SQLite queries are enough | When team size > 1 |
| ClickHouse / Kafka | SQLite event table handles MVP observability | At scale |
This design synthesizes research from 13 topics across agentic loops, feedback mechanisms, code review, testing, CI/CD, orchestration, evaluation, self-improvement, reflection, human-AI collaboration, context management, tool integration, and error recovery.