Implementation Sub-Plan: Build Order (8-Week Detailed Breakdown)
Implementation Sub-Plan: Build Order (8-Week Detailed Breakdown)
Document: 13-build-order.md Date: 2026-02-07 Status: Detailed implementation plan Source: SYSTEM-DESIGN.md Section 13, 02-roadmap.md, 01-architecture.md
Overview
This document breaks down the 8-week build order from the system design into specific day-by-day tasks. Each day is scoped to 4-6 hours of focused work. The plan follows a "skeleton first, organs second" approach: build the foundational infrastructure early, then fill in specialized agent logic.
Core Philosophy:
- Week 1-2: Foundation (types, bus, memory, tools)
- Week 3-4: First vertical slice (Reviewer)
- Week 5: Second agent (Tester)
- Week 6: Complete agent set (Planner + Implementer)
- Week 7: Orchestration layer
- Week 8: Polish and harden
Week 1: Core Skeleton
Goal: Build the foundational types, event system, configuration, and database schema that everything else depends on. By end of week, the project compiles and has passing tests for all core modules.
Day 1: Project Initialization and Schema Design
Time Estimate: 5 hours Dependencies: None
Tasks:
-
Initialize Bun project with TypeScript
bun init- Configure
tsconfig.json(strict mode, path aliases) - Add dependencies:
drizzle-orm,better-sqlite3,zod,ulid
-
Set up directory structure
forge/ ├── src/ │ ├── core/ │ ├── safety/ │ ├── memory/ │ ├── tools/ │ ├── agents/ │ ├── orchestrator/ │ └── cli/ ├── drizzle/ ├── tests/ └── package.json -
Create Drizzle schema (
src/memory/schema.ts)- Events table
- Memories table
- Patterns table
- Checkpoints table
- Runs table
- Findings table
- (Full schema from SYSTEM-DESIGN.md Section 5)
-
Set up Drizzle migrations
drizzle.config.ts- Generate initial migration
- Create seed data script
Files to Create:
package.jsontsconfig.jsonsrc/memory/schema.tsdrizzle.config.tsdrizzle/0000_initial.sqlscripts/seed.ts
Acceptance Criteria:
- ✓
bun run typecheckpasses - ✓
bun run drizzle-kit migratecreates database - ✓ Seed script populates test data
Day 2: Core Types and Error Taxonomy
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Implement core abstractions (
src/core/types.ts)AgentinterfaceForgeEventinterfaceToolinterfacePhaseinterfaceMemoryinterfaceCheckpointinterface- All supporting types from SYSTEM-DESIGN.md Section 3
-
Create error taxonomy (
src/core/errors.ts)- Base
ForgeErrorclass - Specialized errors:
CircuitBreakerErrorConfigurationErrorAgentErrorToolExecutionErrorValidationError
- Error classification helpers
- Severity and recoverability enums
- Base
-
Create Zod schemas for validation (
src/core/schemas.ts)- Validation schemas for all core types
- Runtime type guards
Files to Create:
src/core/types.tssrc/core/errors.tssrc/core/schemas.tstests/core/types.test.tstests/core/errors.test.ts
Acceptance Criteria:
- ✓ All types export correctly
- ✓ Error classes have proper inheritance
- ✓ Zod schemas validate example data
- ✓ Test coverage > 80%
Day 3: Event Bus Implementation
Time Estimate: 5 hours Dependencies: Day 2 complete
Tasks:
-
Implement in-memory event bus (
src/core/bus.ts)EventBusclass with emit/on/off- Wildcard subscriptions
- Event persistence to SQLite
- Replay functionality
- Subscription cleanup
-
Add event tracing
- Generate
traceId(ulid) - Chain events with trace IDs
- Event timestamps and ordering
- Generate
-
Create event helpers
- Event builder pattern
- Common event factories
- Event filtering utilities
Files to Create:
src/core/bus.tstests/core/bus.test.ts
Acceptance Criteria:
- ✓ Can emit events
- ✓ Subscribers receive events
- ✓ Events persist to database
- ✓ Replay reconstructs event history
- ✓ Wildcard subscriptions work
- ✓ No memory leaks in subscriptions
Day 4: Configuration and Safety Defaults
Time Estimate: 4 hours Dependencies: Day 2 complete
Tasks:
-
Implement configuration system (
src/core/config.ts)- Default configuration from SYSTEM-DESIGN.md Section 8
- Per-project config loading (
forge.config.ts) - Environment variable overrides
- Config validation with Zod
- Config merge strategy
-
Create safety control structures (
src/safety/breakers.ts)CircuitBreakerclass- Iteration counter breaker
- Cost tracker breaker
- Time limit breaker
- Error rate breaker
- Breaker state machine (closed/open/half-open)
-
Implement safety budget (
src/safety/budget.ts)- Cost tracking per phase
- Cost tracking per run
- Budget exhaustion handling
Files to Create:
src/core/config.tssrc/safety/breakers.tssrc/safety/budget.tsforge.config.example.tstests/core/config.test.tstests/safety/breakers.test.ts
Acceptance Criteria:
- ✓ Default config loads
- ✓ Can override with project config
- ✓ Circuit breakers trip at thresholds
- ✓ Budget tracking is accurate
Day 5: LLM Provider Abstraction
Time Estimate: 6 hours Dependencies: Day 2, Day 4 complete
Tasks:
-
Create LLM provider interface (
src/tools/llm.ts)LLMProviderinterfaceChatRequest/ChatResponsetypes- Token counting utilities
- Cost calculation per model
-
Implement Anthropic provider
- Claude API integration
- Streaming support
- Tool use protocol
- Error handling and retry
- Rate limiting
-
Add prompt management
- System prompt templates
- Message formatting
- Token limit enforcement
-
Create LLM cost tracker
- Track tokens per call
- Calculate USD cost
- Integrate with budget system
Files to Create:
src/tools/llm.tssrc/tools/llm-anthropic.tssrc/tools/prompts.tstests/tools/llm.test.ts(with mocks)
Acceptance Criteria:
- ✓ Can make chat completions
- ✓ Tool use protocol works
- ✓ Cost tracking is accurate
- ✓ Retries on transient errors
- ✓ Integration test with real API passes
Week 1 Deliverable:
- Project skeleton that compiles
- Core types defined
- Event bus working
- Config and safety defaults in place
- LLM abstraction ready
- All unit tests passing
Week 2: Memory + Tools
Goal: Build the memory system (store, recall, consolidation) and essential tool layer (git, runner, linter). By end of week, an agent loop can execute against a real LLM, call tools, and store memories.
Day 1: Memory Store CRUD
Time Estimate: 5 hours Dependencies: Week 1 complete
Tasks:
-
Implement memory store (
src/memory/store.ts)- CRUD operations for memories
- Query by type (episodic/semantic/procedural)
- Query by confidence threshold
- Query by recency
- Update confidence scores
- Archive low-confidence memories
-
Add memory indexing
- Tag-based search
- Context string matching
- Access tracking (lastAccessed, accessCount)
-
Implement confidence decay
- Weekly decay formula:
confidence -= 0.05 - Reinforcement on access:
confidence += 0.1 - Pruning threshold:
confidence < 0.2
- Weekly decay formula:
Files to Create:
src/memory/store.tstests/memory/store.test.ts
Acceptance Criteria:
- ✓ Can store and retrieve memories
- ✓ Confidence decay works correctly
- ✓ Can query by multiple criteria
- ✓ Access tracking increments
Day 2: Similarity Search and Embeddings
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Add embedding support to memory store
- Store embeddings as
Float32Arrayin blob column - Generate embeddings via LLM provider
- Cosine similarity calculation
- Store embeddings as
-
Implement similarity search
- Embed query string
- Calculate similarity scores
- Return top-k memories
- Cache embeddings to reduce API calls
-
Create memory recall function
recall(context, type, limit)→Memory[]- Combine similarity + recency + confidence
- Ranking algorithm
Files to Create:
src/memory/similarity.tstests/memory/similarity.test.ts
Acceptance Criteria:
- ✓ Embeddings stored correctly
- ✓ Similarity search returns relevant memories
- ✓ Recall function uses multi-factor ranking
Day 3: Episodic and Semantic Memory
Time Estimate: 5 hours Dependencies: Day 1-2 complete
Tasks:
-
Implement episodic memory (
src/memory/episodes.ts)- Store execution events as episodic memories
- Link to trace ID
- Query episodes by time range
- Query episodes by outcome (success/failure)
-
Implement semantic memory (
src/memory/patterns.ts)- Extract patterns from episodic memories
- Pattern frequency tracking
- Success rate calculation
- Pattern trigger matching
-
Create pattern extraction logic
- LLM-based pattern extraction from events
- Pattern deduplication
- Pattern confidence scoring
Files to Create:
src/memory/episodes.tssrc/memory/patterns.tstests/memory/episodes.test.tstests/memory/patterns.test.ts
Acceptance Criteria:
- ✓ Episodes stored with trace IDs
- ✓ Patterns extracted from episodes
- ✓ Pattern matching works
Day 4: Tool Registry and Core Tools
Time Estimate: 6 hours Dependencies: Week 1 complete
Tasks:
-
Create tool registry (
src/tools/registry.ts)- Tool registration
- Tool discovery
- Tool execution with validation
- Tool sandboxing (execution timeout, resource limits)
-
Implement git tool (
src/tools/git.ts)git statusgit diffgit addgit commitgit push- Parse git output
-
Implement shell runner tool (
src/tools/runner.ts)- Execute shell commands
- Capture stdout/stderr
- Timeout enforcement
- Working directory management
-
Implement linter tool (
src/tools/linter.ts)- Run ESLint or Biome
- Parse linter output
- Categorize findings
Files to Create:
src/tools/registry.tssrc/tools/git.tssrc/tools/runner.tssrc/tools/linter.tstests/tools/registry.test.tstests/tools/git.test.tstests/tools/runner.test.ts
Acceptance Criteria:
- ✓ Tools can be registered
- ✓ Tool input/output validated
- ✓ Git operations work
- ✓ Shell commands execute with timeout
- ✓ Linter integration works
Day 5: Base Agent with Loop + First Integration Test
Time Estimate: 6 hours Dependencies: Week 2 Days 1-4 complete
Tasks:
-
Implement base agent (
src/agents/base.ts)- Agent loop: perceive → reason → act → learn
- Integration with LLM provider
- Integration with tool registry
- Integration with memory store
- Circuit breaker integration
- Reflection after execution
-
Create working memory structure
- Message history
- Tool results
- Iteration tracking
-
Implement reflection mechanism
- Post-execution reflection prompt
- Extract learnings
- Store learnings in memory
-
Write first integration test
- Create simple agent that uses a tool
- Execute against real LLM
- Verify tool execution
- Verify memory storage
Files to Create:
src/agents/base.tstests/agents/base.integration.test.ts
Acceptance Criteria:
- ✓ Agent loop executes
- ✓ LLM decides which tool to use
- ✓ Tool executes successfully
- ✓ Memory stores execution
- ✓ Circuit breakers work
- ✓ Reflection extracts learnings
Week 2 Deliverable:
- Memory system fully functional
- Tool layer operational
- Base agent can execute with real LLM
- First integration test passing
- Feedback loop foundation in place
Week 3: Reviewer Agent (First Vertical Slice)
Goal: Build a complete vertical slice: the Reviewer agent that reviews code changes, posts findings, learns from dismissals. This proves the full stack works end-to-end.
Day 1: Reviewer Agent Skeleton
Time Estimate: 5 hours Dependencies: Week 2 complete
Tasks:
-
Create reviewer agent (
src/agents/reviewer.ts)- Extend
BaseAgent - Define agent type:
reviewer - Three-layer review structure:
- Layer 1: Static analysis
- Layer 2: Security scan
- Layer 3: AI review
- Extend
-
Define reviewer input/output types
ReviewInput: code changes, contextReviewOutput: findings, risk score, decision
-
Implement risk scoring algorithm
- Complexity component
- Change size component
- Criticality component
- Calculate overall risk level (low/medium/high/critical)
-
Define review decision logic
- Approve (low risk, no critical findings)
- Request changes (fixable issues)
- Require human (high risk or critical findings)
Files to Create:
src/agents/reviewer.tssrc/agents/reviewer-types.tstests/agents/reviewer.test.ts
Acceptance Criteria:
- ✓ Reviewer extends BaseAgent
- ✓ Risk scoring formula implemented
- ✓ Decision logic implemented
Day 2: Static Analysis Integration
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Create static analysis layer
- Integrate linter tool (ESLint/Biome)
- Run TypeScript strict check
- Run formatting check (Prettier/Biome)
- Parse and categorize findings
-
Map linter output to
Findingtype- Extract file, line, column
- Map severity
- Categorize (style/correctness/etc)
- Determine fixability
-
Deduplicate findings
- Hash-based deduplication
- Merge similar findings
Files to Create:
src/agents/reviewer-static.tstests/agents/reviewer-static.test.ts
Acceptance Criteria:
- ✓ Static analysis runs on code changes
- ✓ Findings extracted correctly
- ✓ Deduplication works
Day 3: AI Review Layer
Time Estimate: 6 hours Dependencies: Day 1-2 complete
Tasks:
-
Create AI review layer
- Construct review prompt with code diff
- Include relevant patterns from memory
- Request LLM to review for:
- Logic correctness
- Edge cases
- Performance implications
- Architecture fit
-
Parse LLM review output
- Extract findings from structured output
- Map to
Findingtype - Confidence scoring
-
Implement risk-based review depth
- Low risk: skip AI review (static only)
- Medium risk: AI review with fast model
- High risk: AI review with strong model
- Critical risk: AI review + human required
Files to Create:
src/agents/reviewer-ai.tssrc/agents/reviewer-prompts.tstests/agents/reviewer-ai.test.ts
Acceptance Criteria:
- ✓ AI review runs for medium+ risk
- ✓ Findings extracted from LLM
- ✓ Risk-based depth selection works
Day 4: GitHub Integration
Time Estimate: 5 hours Dependencies: Day 1-3 complete
Tasks:
-
Create GitHub tool (
src/tools/github.ts)- Fetch PR details
- Fetch PR diff
- Post review comments
- Update PR status check
- Dismiss comments
- React to comment events
-
Integrate with reviewer agent
- Fetch PR on review request
- Post findings as PR comments
- Update check status based on decision
- Group findings by file
-
Format review comments
- Markdown formatting
- Code snippets
- Severity badges
- Fix suggestions
Files to Create:
src/tools/github.tssrc/agents/reviewer-github.tstests/tools/github.test.ts(mocked)
Acceptance Criteria:
- ✓ Can fetch PR details
- ✓ Can post review comments
- ✓ Comments formatted nicely
- ✓ Status checks update
Day 5: Risk Scoring, Finding Persistence, Phase Bounce
Time Estimate: 5 hours Dependencies: Day 1-4 complete
Tasks:
-
Persist findings to database
- Store findings in
findingstable - Link to run ID
- Track dismissals
- Store findings in
-
Implement finding dismissal learning
- Track when humans dismiss findings
- Decrease confidence in related patterns
- Store dismissal reason as learning
-
Implement phase bounce logic (review → implement → re-review)
- Detect "changes requested" decision
- Package findings for implementer
- Track bounce count
- Max bounces: 3
-
Create review metrics
- False positive rate (dismissals / total)
- Review duration
- Findings per review
Files to Create:
src/agents/reviewer-persistence.tssrc/agents/reviewer-learning.tstests/agents/reviewer-learning.test.ts
Acceptance Criteria:
- ✓ Findings persist to database
- ✓ Dismissals decrease pattern confidence
- ✓ Bounce logic works
- ✓ Metrics tracked
Week 3 Deliverable:
- Complete Reviewer agent operational
- Three-layer review working
- GitHub PR integration functional
- Findings persisted and learned from
- Phase bounce logic implemented
Week 4: Tester Agent
Goal: Build the Tester agent that selects tests, executes them, analyzes failures, and suggests fixes. Integrate with phase bounce (test → implement → retest).
Day 1: Tester Agent Skeleton + Test Selection
Time Estimate: 5 hours Dependencies: Week 3 complete
Tasks:
-
Create tester agent (
src/agents/tester.ts)- Extend
BaseAgent - Define agent type:
tester
- Extend
-
Define tester input/output types
TestInput: code changes, existing test suiteTestOutput: test results, failures, coverage, generated tests
-
Implement risk-based test selection
- Changed files → unit tests covering those files
- Medium+ risk → integration tests
- High+ risk → full suite
- Parse test files to build dependency graph
-
Create test selector algorithm
- Static analysis to find test files for modules
- Impact analysis for integration tests
- Prioritize by risk and recency
Files to Create:
src/agents/tester.tssrc/agents/tester-selection.tstests/agents/tester.test.ts
Acceptance Criteria:
- ✓ Tester extends BaseAgent
- ✓ Test selection based on changes works
- ✓ Risk-based selection works
Day 2: Test Runner Integration
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Create test runner tool (
src/tools/test-runner.ts)- Execute Jest/Vitest via shell
- Parse JSON output
- Capture stdout/stderr
- Timeout enforcement
-
Parse test results
- Extract passed/failed/skipped counts
- Extract failure messages
- Extract stack traces
- Extract coverage data
-
Implement retry logic for flaky tests
- Retry failed tests once
- Mark as flaky if inconsistent
- Track flakiness over time
Files to Create:
src/tools/test-runner.tstests/tools/test-runner.test.ts
Acceptance Criteria:
- ✓ Can execute tests
- ✓ Results parsed correctly
- ✓ Retry logic works
- ✓ Flaky tests detected
Day 3: Failure Analysis
Time Estimate: 6 hours Dependencies: Day 1-2 complete
Tasks:
-
Implement failure analyzer (
src/agents/tester-analysis.ts)- Classify failure type:
- Real bug
- Flaky test
- Environment issue
- Outdated snapshot
- Use LLM for root cause analysis
- Extract relevant code context
- Classify failure type:
-
Create failure analysis prompt
- Include test code
- Include failure message
- Include relevant source code
- Request root cause and fix suggestion
-
Confidence scoring for suggested fixes
- High confidence (>0.7) → auto-fixable
- Low confidence → escalate to human
Files to Create:
src/agents/tester-analysis.tssrc/agents/tester-prompts.tstests/agents/tester-analysis.test.ts
Acceptance Criteria:
- ✓ Failures classified correctly
- ✓ LLM provides root cause analysis
- ✓ Fix suggestions generated
- ✓ Confidence scores assigned
Day 4: Test Gap Detection + Generation
Time Estimate: 5 hours Dependencies: Day 1-3 complete
Tasks:
-
Implement test gap detection
- Parse coverage report
- Identify uncovered lines in changed files
- Identify uncovered branches
- Identify functions without tests
-
Create test generator
- Use LLM to generate test cases
- Input: function signature, implementation, examples
- Output: test code
- Validate generated tests (parse, typecheck)
-
Format generated tests
- Match existing test style
- Follow project conventions
- Add descriptive test names
Files to Create:
src/agents/tester-gaps.tssrc/agents/tester-generator.tstests/agents/tester-generator.test.ts
Acceptance Criteria:
- ✓ Uncovered code detected
- ✓ Tests generated for gaps
- ✓ Generated tests are valid
Day 5: Phase Bounce Integration + Metrics
Time Estimate: 5 hours Dependencies: Day 1-4 complete
Tasks:
-
Implement test → implement → retest bounce
- Detect test failures
- Package failure analysis for implementer
- Track bounce count
- Max bounces: 2
-
Create test metrics
- Test duration
- Pass rate over time
- Flaky test rate
- Coverage delta (before/after)
- Test gap count
-
Integrate with memory system
- Store failure patterns
- Store successful fixes
- Recall similar failures from memory
Files to Create:
src/agents/tester-bounce.tssrc/agents/tester-metrics.tstests/agents/tester-bounce.test.ts
Acceptance Criteria:
- ✓ Test failures bounce to implementer
- ✓ Bounce count tracked
- ✓ Metrics tracked
- ✓ Failure patterns stored
Week 4 Deliverable:
- Complete Tester agent operational
- Test selection and execution working
- Failure analysis with suggested fixes
- Test generation for gaps
- Phase bounce (test → fix → retest) working
Week 5: Planner + Implementer Agents
Goal: Build the Planner agent (requirements → plan) and Implementer agent (plan → code). Integrate self-validation loop (write → typecheck → test → fix). End-to-end flow works: plan → implement → review → test.
Day 1: Planner Agent - Requirements Analysis
Time Estimate: 5 hours Dependencies: Week 4 complete
Tasks:
-
Create planner agent (
src/agents/planner.ts)- Extend
BaseAgent - Define agent type:
planner
- Extend
-
Define planner input/output types
PlanInput: requirements (natural language)PlanOutput: analysis, architecture, tasks, risk assessment
-
Implement requirements analysis
- Use LLM to parse requirements
- Extract acceptance criteria
- Identify constraints
- Decompose into stories
-
Create planning prompts
- Requirements analysis prompt
- Include codebase context (file structure)
- Include relevant patterns from memory
Files to Create:
src/agents/planner.tssrc/agents/planner-analysis.tssrc/agents/planner-prompts.tstests/agents/planner.test.ts
Acceptance Criteria:
- ✓ Requirements parsed into structured format
- ✓ Stories extracted
- ✓ Constraints identified
Day 2: Planner Agent - Architecture Design + Task Decomposition
Time Estimate: 6 hours Dependencies: Day 1 complete
Tasks:
-
Implement architecture design
- Use LLM to design system architecture
- Identify components
- Define interfaces
- Document architecture decisions
-
Implement task decomposition
- Break architecture into implementation tasks
- Order tasks by dependencies
- Estimate complexity per task
- Assign priorities
-
Implement risk assessment
- Calculate risk score (same formula as reviewer)
- Determine review depth requirement
- Determine test coverage requirement
-
Create implementation plan
- Combine architecture + tasks + risk
- Package for implementer
Files to Create:
src/agents/planner-design.tssrc/agents/planner-tasks.tssrc/agents/planner-risk.tstests/agents/planner-design.test.ts
Acceptance Criteria:
- ✓ Architecture designed
- ✓ Tasks decomposed and ordered
- ✓ Risk assessed
- ✓ Plan packaged for implementer
Day 3: Implementer Agent - Code Generation
Time Estimate: 6 hours Dependencies: Day 1-2 complete
Tasks:
-
Create implementer agent (
src/agents/implementer.ts)- Extend
BaseAgent - Define agent type:
implementer
- Extend
-
Define implementer input/output types
ImplementInput: plan, taskImplementOutput: code changes, tests added, validated
-
Implement code generation
- Use LLM to generate code from task spec
- Include architecture context
- Include existing code patterns
- Generate tests alongside code
-
Create file operations
- Read existing files
- Write new files
- Modify existing files (diff-based edits)
Files to Create:
src/agents/implementer.tssrc/agents/implementer-generation.tssrc/agents/implementer-prompts.tstests/agents/implementer.test.ts
Acceptance Criteria:
- ✓ Code generated from task spec
- ✓ Tests generated alongside code
- ✓ Files written to disk
Day 4: Implementer Agent - Self-Validation Loop
Time Estimate: 6 hours Dependencies: Day 1-3 complete
Tasks:
-
Implement self-validation loop
- After code generation:
- Run typecheck
- Run linter
- Run affected tests
- If issues found:
- Analyze issues
- Fix issues
- Loop back (max 3 iterations)
- After code generation:
-
Create validation tools
- Typecheck runner (tsc --noEmit)
- Linter runner (reuse from Week 3)
- Test runner (reuse from Week 4)
-
Implement fix logic
- Parse validation errors
- Use LLM to suggest fixes
- Apply fixes
- Re-validate
-
Track validation metrics
- Iterations to valid code
- Issues fixed per iteration
- Final validation status
Files to Create:
src/agents/implementer-validation.tssrc/agents/implementer-fix.tstests/agents/implementer-validation.test.ts
Acceptance Criteria:
- ✓ Typecheck runs after code generation
- ✓ Linter runs after code generation
- ✓ Tests run after code generation
- ✓ Issues auto-fixed (up to max iterations)
- ✓ Final code typechecks and passes tests
Day 5: End-to-End Integration Test
Time Estimate: 5 hours Dependencies: Day 1-4 complete
Tasks:
-
Create integration test for full flow
- Input: simple feature requirement
- Plan → Implement → Review → Test
- Verify each phase output
- Verify phase transitions
- Verify bounce logic works
-
Test phase bounces
- Review finds issue → bounce to implement → fix → re-review
- Test fails → bounce to implement → fix → re-test
-
Verify memory integration
- Learnings stored at each phase
- Patterns extracted after run
- Memories recalled in next run
-
Create test fixtures
- Sample requirements
- Sample codebase
- Expected outputs
Files to Create:
tests/integration/full-flow.test.tstests/fixtures/sample-requirements.tstests/fixtures/sample-codebase/
Acceptance Criteria:
- ✓ Full flow executes without errors
- ✓ Phase bounces work
- ✓ Memories stored and recalled
- ✓ Final code is valid and tested
Week 5 Deliverable:
- Planner agent functional (requirements → plan)
- Implementer agent functional (plan → code)
- Self-validation loop working
- End-to-end flow tested (plan → implement → review → test)
- Phase bounces working across all agents
Week 6: Orchestrator
Goal: Build the orchestration layer that coordinates agents through the pipeline. State machine, checkpoints, shared context, human gates. "forge run" executes the complete pipeline.
Day 1: Pipeline State Machine + Phase Sequencing
Time Estimate: 6 hours Dependencies: Week 5 complete
Tasks:
-
Create pipeline orchestrator (
src/orchestrator/pipeline.ts)- State machine: idle → planning → implementing → reviewing → testing → deploying → monitoring → completed
- Phase transitions
- Phase input/output validation
- Phase execution logic
-
Define phase configuration
- Each phase has:
- Agent assignment
- Guards (pre-conditions)
- Gates (human approval checkpoints)
- Breakers (circuit breakers)
- Next phase
- Each phase has:
-
Implement phase execution
- Load phase config
- Check guards
- Execute agent
- Check breakers
- Check gates
- Transition to next phase
-
Handle phase failures
- Capture errors
- Determine recoverability
- Retry logic (if transient)
- Checkpoint before retry
Files to Create:
src/orchestrator/pipeline.tssrc/orchestrator/phases.tstests/orchestrator/pipeline.test.ts
Acceptance Criteria:
- ✓ State machine transitions correctly
- ✓ Phases execute in sequence
- ✓ Phase failures handled
- ✓ Guards and breakers enforced
Day 2: Checkpoint System
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Create checkpoint system (
src/orchestrator/checkpoint.ts)- Create checkpoint after each phase
- Store phase state to database
- List checkpoints for a run
- Restore from checkpoint
- Compare checkpoints (diff)
-
Define checkpoint schema
- Checkpoint ID (ulid)
- Trace ID (links to run)
- Phase name
- State (JSON serialized)
- Timestamp
-
Implement state serialization
- Serialize complex types (Dates, Maps, etc.)
- Deserialize on restore
- Validate restored state
-
Integrate with pipeline
- Auto-checkpoint after each phase
- Resume from last checkpoint on failure
- Clean up old checkpoints (retention policy)
Files to Create:
src/orchestrator/checkpoint.tstests/orchestrator/checkpoint.test.ts
Acceptance Criteria:
- ✓ Checkpoints created after each phase
- ✓ State serialized correctly
- ✓ Can restore from checkpoint
- ✓ Pipeline resumes from checkpoint
Day 3: Context Bus (Shared State)
Time Estimate: 5 hours Dependencies: Day 1-2 complete
Tasks:
-
Create context bus (
src/orchestrator/context.ts)- Shared state store (in-memory + persistent)
- Key-value interface
- Get/set/update/subscribe
- Snapshot and restore
-
Implement context scoping
- Global context (across all runs)
- Run context (scoped to trace ID)
- Phase context (scoped to phase)
-
Integrate with agents
- Agents read from context
- Agents write to context
- Context persisted at checkpoints
-
Create context helpers
- Context builders
- Type-safe context accessors
- Context validation
Files to Create:
src/orchestrator/context.tstests/orchestrator/context.test.ts
Acceptance Criteria:
- ✓ Context shared across agents
- ✓ Context persisted at checkpoints
- ✓ Context scoping works
- ✓ Type safety maintained
Day 4: Human Approval Gates
Time Estimate: 5 hours Dependencies: Day 1-3 complete
Tasks:
-
Create gate system (
src/safety/gates.ts)- Define gate types:
- Architecture approval (high-risk plans)
- Production deploy (always)
- Security findings (critical severity)
- Cost overrun (approaching budget)
- Gate condition evaluation
- Gate timeout handling
- Define gate types:
-
Implement gate workflow
- Pause pipeline at gate
- Notify human (CLI prompt, webhook, etc.)
- Wait for approval/rejection
- Resume pipeline on approval
- Halt pipeline on rejection
- Timeout fallback (default: reject)
-
Create CLI prompts for gates
- Display gate context
- Display relevant information
- Yes/no/defer prompt
- Optional feedback input
-
Integrate with pipeline
- Check gates after phase execution
- Store gate decisions as events
Files to Create:
src/safety/gates.tssrc/cli/gates.tstests/safety/gates.test.ts
Acceptance Criteria:
- ✓ Gates evaluated correctly
- ✓ Pipeline pauses at gate
- ✓ CLI prompts display
- ✓ Approval resumes pipeline
- ✓ Rejection halts pipeline
- ✓ Timeout handled
Day 5: Full Pipeline Integration Test + "forge run"
Time Estimate: 5 hours Dependencies: Day 1-4 complete
Tasks:
-
Wire up complete pipeline
- Integrate all agents
- Integrate all phases
- Integrate checkpoints
- Integrate context bus
- Integrate gates
- Integrate breakers
-
Create "forge run" CLI command (
src/cli/commands/run.ts)- Parse requirements from CLI
- Initialize pipeline
- Execute pipeline
- Display progress
- Handle interruptions (Ctrl+C)
- Display final status
-
Create full integration test
- Run complete pipeline with real LLM
- Verify all phases execute
- Verify checkpoints created
- Verify gates triggered
- Verify final output
Files to Create:
src/cli/commands/run.tstests/integration/pipeline-full.test.ts
Acceptance Criteria:
- ✓ "forge run" command works
- ✓ Complete pipeline executes
- ✓ All agents coordinate correctly
- ✓ Checkpoints work
- ✓ Gates work
- ✓ Breakers work
Week 6 Deliverable:
- Pipeline orchestrator functional
- State machine working
- Checkpoint system operational
- Context bus sharing state
- Human approval gates working
- "forge run" executes complete pipeline
- Full integration test passing
Week 7: CLI + Polish
Goal: Build a polished CLI with commands for running, reviewing, testing, status checks, and history. Terminal UI for progress and findings. Per-project configuration. Memory consolidation job.
Day 1: CLI Framework + Core Commands (Part 1)
Time Estimate: 5 hours Dependencies: Week 6 complete
Tasks:
-
Set up CLI framework
- Use
commanderor similar for CLI parsing - Define command structure
- Add help text
- Add version flag
- Use
-
Implement "forge run" (refine from Week 6 Day 5)
- Accept requirements as argument or file
- Accept config overrides via flags
- Display progress indicators
- Handle errors gracefully
-
Implement "forge review"
- Manually trigger review of PR or local changes
- Display review findings
- Option to auto-fix findings
-
Create CLI utilities
- Spinner/progress indicators
- Error formatting
- Success messages
Files to Create:
src/cli/index.tssrc/cli/commands/run.ts(refine)src/cli/commands/review.tssrc/cli/ui/spinner.tssrc/cli/ui/format.ts
Acceptance Criteria:
- ✓ CLI parses commands
- ✓ Help text displays
- ✓ "forge run" works
- ✓ "forge review" works
- ✓ Progress indicators display
Day 2: CLI Core Commands (Part 2)
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Implement "forge test"
- Manually trigger test execution
- Display test results
- Display failure analysis
- Option to generate missing tests
-
Implement "forge status"
- Display current pipeline status
- Display recent runs
- Display cost usage
- Display agent health
-
Implement "forge history"
- List past runs
- Filter by status/date
- Display run details
- Replay events from a run
-
Create status display
- Table formatting
- Color coding by status
- Duration formatting
- Cost formatting
Files to Create:
src/cli/commands/test.tssrc/cli/commands/status.tssrc/cli/commands/history.tssrc/cli/ui/table.ts
Acceptance Criteria:
- ✓ "forge test" works
- ✓ "forge status" displays current state
- ✓ "forge history" lists runs
- ✓ Tables formatted nicely
Day 3: Terminal UI (Progress + Findings Display)
Time Estimate: 5 hours Dependencies: Day 1-2 complete
Tasks:
-
Create progress display for pipeline
- Show current phase
- Show phase progress (iteration count, time elapsed)
- Show cost so far
- Update in real-time
-
Create findings display
- Group findings by file
- Color code by severity
- Show code snippets
- Show fix suggestions
- Collapsible sections
-
Create event stream display
- Show events as they occur
- Filter by event type
- Expandable details
-
Create dashboard view
- Overview of system state
- Recent activity
- Cost dashboard
- Agent status
Files to Create:
src/cli/ui/progress.tssrc/cli/ui/findings.tssrc/cli/ui/events.tssrc/cli/ui/dashboard.ts
Acceptance Criteria:
- ✓ Progress updates in real-time
- ✓ Findings display is readable
- ✓ Events stream displays
- ✓ Dashboard shows system state
Day 4: Per-Project Configuration
Time Estimate: 4 hours Dependencies: Week 6 complete
Tasks:
-
Create configuration loader
- Load
forge.config.tsfrom project root - Merge with defaults
- Validate configuration
- Load
-
Define configuration schema
- Project metadata (name, language)
- LLM provider settings
- Tool commands (test, lint, build, typecheck)
- Safety settings (cost limits, iteration limits)
- GitHub integration settings
- Memory settings
-
Implement "forge init" command
- Generate
forge.config.tstemplate - Interactive prompts for key settings
- Detect project type (package.json, etc.)
- Generate
-
Create configuration validation
- Validate all required fields
- Validate types
- Provide helpful error messages
Files to Create:
src/cli/commands/init.tssrc/core/config-loader.tsforge.config.template.tstests/core/config-loader.test.ts
Acceptance Criteria:
- ✓ "forge init" creates config file
- ✓ Config loads from project root
- ✓ Config merges with defaults
- ✓ Config validation works
Day 5: Memory Consolidation Job
Time Estimate: 6 hours Dependencies: Week 2 complete
Tasks:
-
Create consolidation job (
src/memory/consolidate.ts)- Run periodically (nightly or on-demand)
- Extract patterns from episodic memories
- Merge similar patterns
- Promote high-frequency episodes to patterns
- Decay confidence on unused memories
- Prune low-confidence memories
- Archive old memories
-
Implement pattern extraction
- Query recent episodic memories
- Use LLM to extract patterns
- Deduplicate patterns
- Calculate frequency and success rate
-
Implement memory pruning
- Identify low-confidence memories (< 0.2)
- Identify stale memories (not accessed in 90 days)
- Archive or delete
-
Implement "forge consolidate" command
- Manually trigger consolidation
- Display consolidation results
- Show patterns extracted
- Show memories pruned
Files to Create:
src/memory/consolidate.tssrc/cli/commands/consolidate.tstests/memory/consolidate.test.ts
Acceptance Criteria:
- ✓ Consolidation job runs
- ✓ Patterns extracted from episodes
- ✓ Similar patterns merged
- ✓ Low-confidence memories pruned
- ✓ "forge consolidate" command works
Week 7 Deliverable:
- Polished CLI with all core commands
- Terminal UI with progress and findings display
- Per-project configuration system
- "forge init" bootstraps new projects
- Memory consolidation job operational
- User experience polished
Week 8: Harden + Document
Goal: Error recovery, cost tracking dashboard, real-world testing on actual projects, edge case handling, documentation. Production-ready v0.1.
Day 1: Error Recovery (Part 1)
Time Estimate: 5 hours Dependencies: Week 7 complete
Tasks:
-
Implement retry logic
- Retry transient errors (network, rate limit)
- Exponential backoff
- Max retries per operation
- Log all retries
-
Implement fallback strategies
- If LLM fails: fallback to cached response or simpler prompt
- If tool fails: fallback to alternative tool or manual mode
- If agent fails: fallback to previous checkpoint
-
Classify errors by recoverability
- Transient (retry)
- Permanent (fail)
- User error (prompt for correction)
-
Enhance error messages
- Actionable error messages
- Suggest recovery steps
- Link to documentation
Files to Create:
src/core/retry.tssrc/core/fallback.tstests/core/retry.test.ts
Acceptance Criteria:
- ✓ Transient errors retried
- ✓ Fallback strategies work
- ✓ Error messages are helpful
Day 2: Error Recovery (Part 2) + Checkpoint Resume
Time Estimate: 5 hours Dependencies: Day 1 complete
Tasks:
-
Implement checkpoint resume
- "forge resume" command
- List resumable runs (failed or interrupted)
- Resume from last checkpoint
- Replay events up to checkpoint
-
Implement error recovery workflow
- When pipeline fails:
- Create checkpoint
- Log error
- Notify user
- Offer to resume or rollback
- When user resumes:
- Restore from checkpoint
- Attempt to fix error
- Continue pipeline
- When pipeline fails:
-
Test recovery scenarios
- Network failure during LLM call
- Tool execution timeout
- Circuit breaker trip
- User interrupt (Ctrl+C)
Files to Create:
src/cli/commands/resume.tssrc/orchestrator/recovery.tstests/orchestrator/recovery.test.ts
Acceptance Criteria:
- ✓ Failed runs can be resumed
- ✓ Checkpoint restore works
- ✓ Recovery workflow works
- ✓ All recovery scenarios tested
Day 3: Cost Tracking Dashboard
Time Estimate: 5 hours Dependencies: Week 7 complete
Tasks:
-
Create cost tracking system
- Track costs per run
- Track costs per phase
- Track costs per agent
- Track costs per day/week/month
- Store in database
-
Create cost dashboard
- Display current spend
- Display budget remaining
- Display cost breakdown (by agent, by phase)
- Display cost trends over time
- Warn on approaching limits
-
Implement "forge costs" command
- Display cost dashboard
- Filter by date range
- Export cost data (CSV)
-
Add cost estimates
- Before pipeline execution, estimate cost
- Display estimate to user
- Prompt for confirmation if > threshold
Files to Create:
src/safety/cost-tracker.tssrc/cli/commands/costs.tssrc/cli/ui/cost-dashboard.tstests/safety/cost-tracker.test.ts
Acceptance Criteria:
- ✓ Costs tracked accurately
- ✓ Dashboard displays cost breakdown
- ✓ "forge costs" command works
- ✓ Cost estimates before execution
- ✓ Warnings on approaching limits
Day 4: Real-World Testing + Bug Fixes
Time Estimate: 6 hours Dependencies: Week 7 complete, Days 1-3 complete
Tasks:
-
Test on real projects
- Select 2-3 real codebases
- Run "forge review" on recent PRs
- Run "forge test" on test suites
- Run "forge run" with simple requirements
- Document all issues
-
Fix discovered bugs
- Prioritize critical bugs
- Fix issues from real-world testing
- Add regression tests
-
Performance profiling
- Profile slow operations
- Optimize database queries
- Add caching where appropriate
- Reduce LLM calls where possible
-
Improve robustness
- Add null checks
- Add input validation
- Handle edge cases
- Improve error handling
Files to Create:
tests/real-world/(test logs and results)- Various bug fix commits
Acceptance Criteria:
- ✓ Tested on real projects
- ✓ Critical bugs fixed
- ✓ Performance acceptable
- ✓ Regression tests added
Day 5: Edge Cases + Documentation + Release Prep
Time Estimate: 6 hours Dependencies: Week 8 Days 1-4 complete
Tasks:
-
Handle edge cases
- Empty codebase
- Very large codebase
- Binary files
- Permission issues
- Network offline
- Out of disk space
- Invalid configuration
-
Write documentation
- README.md (installation, quick start)
- ARCHITECTURE.md (system overview)
- CONFIGURATION.md (config options)
- CLI.md (command reference)
- DEVELOPMENT.md (contributing guide)
-
Create examples
- Example configurations
- Example workflows
- Tutorial: first feature with Forge
-
Release preparation
- Update version to v0.1.0
- Write CHANGELOG.md
- Create GitHub release
- Tag release commit
Files to Create:
README.mddocs/ARCHITECTURE.mddocs/CONFIGURATION.mddocs/CLI.mddocs/DEVELOPMENT.mdCHANGELOG.mdexamples/(various examples)
Acceptance Criteria:
- ✓ Edge cases handled
- ✓ Documentation complete
- ✓ Examples working
- ✓ Release tagged
Week 8 Deliverable:
- Error recovery working
- Checkpoint resume functional
- Cost tracking dashboard
- Real-world testing complete
- Bugs fixed
- Edge cases handled
- Documentation complete
- Production-ready v0.1 released
Dependencies Between Weeks
Critical Path
The following must be completed in sequence (cannot parallelize):
Week 1 → Week 2 → Week 3
↓
Week 4
↓
Week 5
↓
Week 6 → Week 7 → Week 8
Dependency Details
| Week | Must Complete Before | Why |
|---|---|---|
| Week 1 | Week 2 | Memory and tools need core types, bus, config |
| Week 2 | Week 3 | Reviewer needs memory system and tools |
| Week 3 | Week 4 | Tester uses same patterns as Reviewer |
| Week 4 | Week 5 | Planner/Implementer need Reviewer and Tester to exist for bounces |
| Week 5 | Week 6 | Orchestrator coordinates existing agents |
| Week 6 | Week 7 | CLI commands need working orchestrator |
| Week 7 | Week 8 | Can't harden what doesn't exist yet |
Parallelization Opportunities
Within each week, some tasks can be parallelized:
Week 1:
- Day 3 (Bus) and Day 4 (Config) can be parallelized
Week 2:
- Day 1-2 (Memory) and Day 4 (Tools) can be partially parallelized
Week 3:
- Day 2 (Static) and Day 3 (AI review) can be developed in parallel
Week 5:
- Day 1-2 (Planner) and Day 3-4 (Implementer) can be parallelized by 2 developers
Week 7:
- Day 1-2 (CLI commands) and Day 3 (UI) can be parallelized
Risk per Week
Week 1 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Schema design wrong | Medium | High | Review against all agent requirements before Day 2 |
| Drizzle setup issues | Low | Medium | Use Drizzle docs, test migrations early |
| Type system too complex | Medium | Medium | Start simple, iterate; avoid premature abstraction |
Week 2 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Similarity search too slow | Medium | Medium | Start with simple implementation, optimize later |
| LLM API rate limits | Medium | Medium | Implement exponential backoff, use caching |
| Memory bloat | Medium | High | Implement pruning early; test with large datasets |
Week 3 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| AI review too expensive | High | High | Use cheap model (Haiku) for reviews; cache aggressively |
| GitHub API rate limits | Medium | Medium | Implement backoff; use GraphQL for efficiency |
| False positive rate too high | High | Medium | Tune confidence thresholds; learn from dismissals |
Week 4 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Test selection misses bugs | Medium | Critical | Always run baseline smoke tests; increase coverage for high-risk |
| Flaky test detection fails | Medium | Medium | Require 2+ inconsistent results to mark as flaky |
| Test generation low quality | High | Low | Human review required before merging generated tests |
Week 5 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Code generation quality poor | High | Critical | Self-validation loop catches many issues; reviewer catches rest |
| Self-validation loop infinite | Medium | High | Max iterations (3); circuit breakers |
| Integration test takes too long | Medium | Medium | Use small test fixture; mock LLM for speed tests |
Week 6 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| State machine bugs | Medium | High | Thorough testing of all state transitions |
| Checkpoint serialization fails | Medium | High | Test with complex state; validate on restore |
| Phase bounce infinite loop | Medium | High | Max bounce count (3); circuit breaker |
Week 7 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| CLI UX confusing | Medium | Medium | User testing; iterate on feedback |
| Configuration too complex | Medium | Medium | Sensible defaults; "forge init" generates valid config |
| Memory consolidation too slow | Low | Medium | Run async; user can continue working |
Week 8 Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Real-world testing uncovers major bugs | High | High | Budget extra time; prioritize critical bugs |
| Recovery logic flawed | Medium | Critical | Test all recovery scenarios; manual testing |
| Not enough time for polish | High | Medium | Cut scope if needed; v0.1 is MVP, not perfect |
Acceptance Criteria per Week
Week 1: Core Skeleton
- Project initializes and compiles with
bun run typecheck - Database schema created and migrations run
- All core types exported from
src/core/types.ts - Error taxonomy has 5+ specialized error classes
- Event bus can emit, subscribe, persist, and replay events
- Configuration loads from defaults and project config
- Circuit breakers trip at configured thresholds
- LLM abstraction can make chat completions with Anthropic
- All unit tests pass with >80% coverage
- At least one integration test with real LLM passes
Week 2: Memory + Tools
- Memory store can create, read, update, delete memories
- Similarity search returns relevant memories
- Episodic memories stored and queried by trace ID
- Patterns extracted from episodic memories
- Tool registry can register and execute tools
- Git tool can run basic git commands
- Shell runner tool executes commands with timeout
- Linter tool integrates with ESLint or Biome
- Base agent loop executes with real LLM
- Base agent can call tools and store memories
- All unit tests pass with >80% coverage
Week 3: Reviewer Agent
- Reviewer agent extends BaseAgent
- Static analysis layer runs and parses lint output
- AI review layer runs for medium+ risk changes
- Risk scoring algorithm implemented
- GitHub tool can fetch PR and post comments
- Findings persist to database
- Finding dismissals decrease pattern confidence
- Phase bounce logic (review → implement → re-review) works
- Review metrics tracked (false positive rate, duration)
- Integration test: review a real PR
Week 4: Tester Agent
- Tester agent extends BaseAgent
- Risk-based test selection works
- Test runner tool executes tests and parses output
- Flaky test detection (retry logic) works
- Failure analysis classifies failures correctly
- LLM provides root cause analysis for failures
- Test gap detection identifies uncovered code
- Test generator creates valid tests for gaps
- Phase bounce logic (test → implement → re-test) works
- Test metrics tracked (pass rate, flaky rate, coverage)
Week 5: Planner + Implementer
- Planner agent analyzes requirements and extracts stories
- Planner agent designs architecture
- Planner agent decomposes into ordered tasks
- Planner agent assesses risk
- Implementer agent generates code from task spec
- Implementer agent generates tests alongside code
- Self-validation loop runs typecheck, linter, tests
- Self-validation loop auto-fixes issues (max 3 iterations)
- End-to-end integration test: plan → implement → review → test
- Phase bounces work across all agents
Week 6: Orchestrator
- Pipeline state machine transitions through all phases
- Checkpoint system creates checkpoints after each phase
- Checkpoint restore works correctly
- Context bus shares state across agents
- Human approval gates pause pipeline and wait for approval
- Circuit breakers halt pipeline when thresholds exceeded
- "forge run" command executes complete pipeline
- Full integration test with all phases passes
Week 7: CLI + Polish
- CLI framework parses commands and displays help
- "forge run" command works (from Week 6 refined)
- "forge review" command works
- "forge test" command works
- "forge status" displays current pipeline status
- "forge history" lists past runs
- "forge init" generates project configuration
- Terminal UI displays progress in real-time
- Terminal UI displays findings nicely formatted
- Memory consolidation job extracts patterns and prunes memories
- "forge consolidate" command works
Week 8: Harden + Document
- Retry logic handles transient errors
- Fallback strategies implemented for LLM and tool failures
- "forge resume" command resumes failed runs from checkpoint
- Error recovery workflow tested (all scenarios)
- Cost tracking accurate per run/phase/agent
- "forge costs" command displays cost dashboard
- Cost estimates shown before pipeline execution
- Tested on 2-3 real projects
- Critical bugs from real-world testing fixed
- Performance profiled and optimized
- Edge cases handled (empty codebase, large codebase, etc.)
- Documentation complete (README, architecture, CLI reference, configuration, development guide)
- Examples created and working
- v0.1.0 tagged and released
Integration Test Plan
Week Boundary Integration Tests
End of Week 1:
- Test: Initialize project, load config, emit event, persist to database, replay events
- Test: LLM chat completion with real API
- Expected: All pass, no errors
End of Week 2:
- Test: Store memory, query by similarity, recall relevant memories
- Test: Register tool, execute tool, capture result
- Test: Base agent loop with LLM + tool execution
- Expected: Agent executes tool based on LLM decision, stores memory
End of Week 3:
- Test: Review a small code change (mock PR)
- Test: Reviewer runs static analysis, AI review, posts findings
- Test: Finding dismissed, confidence decreases
- Test: Phase bounce (review → implement → re-review)
- Expected: Complete review with findings, bounce works
End of Week 4:
- Test: Tester selects tests based on code changes
- Test: Tester executes tests, analyzes failures
- Test: Tester generates tests for gaps
- Test: Phase bounce (test → implement → re-test)
- Expected: Test failures analyzed, gaps filled, bounce works
End of Week 5:
- Test: Full flow: requirements → plan → implement → review → test
- Test: Self-validation loop fixes typecheck/lint/test issues
- Test: Phase bounces work across all agents
- Expected: End-to-end flow produces valid, tested code
End of Week 6:
- Test: "forge run" executes complete pipeline
- Test: Checkpoints created and restored correctly
- Test: Human approval gates pause and resume pipeline
- Test: Circuit breakers halt pipeline when thresholds exceeded
- Expected: Complete pipeline execution with all safety controls
End of Week 7:
- Test: All CLI commands work ("forge run", "forge review", "forge test", "forge status", "forge history", "forge init", "forge consolidate")
- Test: Terminal UI displays correctly
- Test: Memory consolidation extracts patterns
- Expected: Polished user experience, all commands functional
End of Week 8:
- Test: Recovery from all failure scenarios (network error, tool timeout, circuit breaker, user interrupt)
- Test: Cost tracking accurate across multiple runs
- Test: Real-world project testing (run full pipeline on actual codebase)
- Expected: Production-ready system, all edge cases handled
File Creation Checklist by Week
Week 1 Files
src/memory/schema.ts
src/core/types.ts
src/core/errors.ts
src/core/schemas.ts
src/core/bus.ts
src/core/config.ts
src/safety/breakers.ts
src/safety/budget.ts
src/tools/llm.ts
src/tools/llm-anthropic.ts
src/tools/prompts.ts
tests/core/types.test.ts
tests/core/errors.test.ts
tests/core/bus.test.ts
tests/core/config.test.ts
tests/safety/breakers.test.ts
tests/tools/llm.test.ts
drizzle.config.ts
drizzle/0000_initial.sql
scripts/seed.ts
forge.config.example.ts
Week 2 Files
src/memory/store.ts
src/memory/similarity.ts
src/memory/episodes.ts
src/memory/patterns.ts
src/tools/registry.ts
src/tools/git.ts
src/tools/runner.ts
src/tools/linter.ts
src/agents/base.ts
tests/memory/store.test.ts
tests/memory/similarity.test.ts
tests/memory/episodes.test.ts
tests/memory/patterns.test.ts
tests/tools/registry.test.ts
tests/tools/git.test.ts
tests/tools/runner.test.ts
tests/agents/base.integration.test.ts
Week 3 Files
src/agents/reviewer.ts
src/agents/reviewer-types.ts
src/agents/reviewer-static.ts
src/agents/reviewer-ai.ts
src/agents/reviewer-prompts.ts
src/agents/reviewer-github.ts
src/agents/reviewer-persistence.ts
src/agents/reviewer-learning.ts
src/tools/github.ts
tests/agents/reviewer.test.ts
tests/agents/reviewer-static.test.ts
tests/agents/reviewer-ai.test.ts
tests/agents/reviewer-learning.test.ts
tests/tools/github.test.ts
Week 4 Files
src/agents/tester.ts
src/agents/tester-selection.ts
src/agents/tester-analysis.ts
src/agents/tester-prompts.ts
src/agents/tester-gaps.ts
src/agents/tester-generator.ts
src/agents/tester-bounce.ts
src/agents/tester-metrics.ts
src/tools/test-runner.ts
tests/agents/tester.test.ts
tests/agents/tester-analysis.test.ts
tests/agents/tester-generator.test.ts
tests/agents/tester-bounce.test.ts
tests/tools/test-runner.test.ts
Week 5 Files
src/agents/planner.ts
src/agents/planner-analysis.ts
src/agents/planner-design.ts
src/agents/planner-tasks.ts
src/agents/planner-risk.ts
src/agents/planner-prompts.ts
src/agents/implementer.ts
src/agents/implementer-generation.ts
src/agents/implementer-validation.ts
src/agents/implementer-fix.ts
src/agents/implementer-prompts.ts
tests/agents/planner.test.ts
tests/agents/planner-design.test.ts
tests/agents/implementer.test.ts
tests/agents/implementer-validation.test.ts
tests/integration/full-flow.test.ts
tests/fixtures/sample-requirements.ts
tests/fixtures/sample-codebase/
Week 6 Files
src/orchestrator/pipeline.ts
src/orchestrator/phases.ts
src/orchestrator/checkpoint.ts
src/orchestrator/context.ts
src/safety/gates.ts
src/cli/gates.ts
src/cli/commands/run.ts
tests/orchestrator/pipeline.test.ts
tests/orchestrator/checkpoint.test.ts
tests/orchestrator/context.test.ts
tests/safety/gates.test.ts
tests/integration/pipeline-full.test.ts
Week 7 Files
src/cli/index.ts
src/cli/commands/run.ts (refine)
src/cli/commands/review.ts
src/cli/commands/test.ts
src/cli/commands/status.ts
src/cli/commands/history.ts
src/cli/commands/init.ts
src/cli/commands/consolidate.ts
src/cli/ui/spinner.ts
src/cli/ui/format.ts
src/cli/ui/table.ts
src/cli/ui/progress.ts
src/cli/ui/findings.ts
src/cli/ui/events.ts
src/cli/ui/dashboard.ts
src/core/config-loader.ts
src/memory/consolidate.ts
forge.config.template.ts
tests/core/config-loader.test.ts
tests/memory/consolidate.test.ts
Week 8 Files
src/core/retry.ts
src/core/fallback.ts
src/cli/commands/resume.ts
src/cli/commands/costs.ts
src/orchestrator/recovery.ts
src/safety/cost-tracker.ts
src/cli/ui/cost-dashboard.ts
tests/core/retry.test.ts
tests/orchestrator/recovery.test.ts
tests/safety/cost-tracker.test.ts
tests/real-world/
README.md
docs/ARCHITECTURE.md
docs/CONFIGURATION.md
docs/CLI.md
docs/DEVELOPMENT.md
CHANGELOG.md
examples/
Total Files: ~150+ (core implementation + tests + docs)
Daily Time Estimates Summary
| Week | Total Hours | Avg Hours/Day |
|---|---|---|
| Week 1 | 25 hours | 5 hours |
| Week 2 | 27 hours | 5.4 hours |
| Week 3 | 26 hours | 5.2 hours |
| Week 4 | 26 hours | 5.2 hours |
| Week 5 | 28 hours | 5.6 hours |
| Week 6 | 26 hours | 5.2 hours |
| Week 7 | 25 hours | 5 hours |
| Week 8 | 27 hours | 5.4 hours |
| Total | 210 hours | 5.25 hours/day |
Note: Each day is scoped to 4-6 hours of focused work, leaving buffer time for unexpected issues, learning, and context switching.
Success Criteria for MVP (v0.1)
At the end of 8 weeks, the system must meet these criteria to be considered production-ready:
Functional Criteria
- "forge run" executes a complete pipeline from requirements to tested code
- All five agents (Planner, Implementer, Reviewer, Tester, Deployer stub) operational
- Memory system stores and recalls learnings across runs
- Phase bounces work (review → fix, test → fix)
- Human approval gates work for high-risk decisions
- Circuit breakers halt runaway execution
- Checkpoint system allows resume from failure
- CLI provides commands for all core workflows
- Real-world testing on 2+ projects successful
Quality Criteria
- Test coverage > 80% for core modules
- All integration tests pass
- No critical bugs in issue tracker
- Performance acceptable (pipeline completes in <30 min for simple features)
- Cost per run < $10 for typical feature
Documentation Criteria
- README with installation and quick start
- Architecture documentation
- CLI reference
- Configuration guide
- At least 2 working examples
Safety Criteria
- No unintended code execution
- No secret leakage
- Human approval required for production deploys
- Cost limits enforced
- Error recovery tested for all failure modes
Post-MVP Roadmap (Weeks 9-12+)
After v0.1 release, these are the next priorities:
Week 9-10: Deployer Agent (Full Implementation)
Currently just a stub. Implement:
- Canary deployment strategy
- Health check monitoring
- Auto-rollback on failure
- Feature flag integration
Week 11: Multi-Agent Parallelization
- Implement agent swarm for parallel implementation
- Coordinate multiple implementers working on different modules
- Merge strategy for parallel changes
Week 12: Advanced Memory
- Vector database integration (replace naive similarity search)
- Knowledge graph for relationships between patterns
- Cross-run learning improvements
- Meta-reflection (reflect on reflection quality)
Beyond Week 12
- Natural language requirements interface
- Multi-repo intelligence
- Self-improving prompts (GEP protocol full implementation)
- Visual design-to-code pipeline
- Integration with issue trackers (Jira, Linear)
Conclusion
This 8-week build plan provides a detailed, day-by-day roadmap for implementing the Forge agentic SDLC orchestration system. Each day's work is scoped to 4-6 hours of focused development, with clear deliverables and acceptance criteria.
Key Success Factors:
- Follow the order strictly - Dependencies are real; skipping ahead will cause rework
- Test continuously - Each day has tests; don't accumulate testing debt
- Commit frequently - Small, atomic commits make debugging easier
- Review the system design - Keep SYSTEM-DESIGN.md open; refer to it constantly
- Adapt as needed - This is a plan, not a contract; adjust based on learnings
By end of Week 8, you will have a production-ready v0.1 of Forge that can orchestrate the full SDLC with AI agents, learn from every execution, and provide a polished developer experience.
Now go build it.