architecture

February 8, 2026

Implementation Plan: Agent Designs (Section 7)

Plan: 07-agent-designs.md Date: 2026-02-07 Scope: Detailed implementation specifications for all five specialized agents in the Forge system Dependencies: Core abstractions (types.ts, base.ts), Tool Layer, Memory Layer

Overview

This plan provides complete implementation specifications for the five core agents in the Forge pipeline. Each agent extends BaseAgent and follows the perceive → reason → act → learn loop, with specialized prompts, tools, and decision logic.

Agents covered:

Planner Agent — Requirements → Architecture + Task List
Implementer Agent — Tasks → Code + Tests
Reviewer Agent — Code → Findings + Risk Score
Tester Agent — Code → Test Results + Analysis
Deployer Agent — Validated Code → Deployment

Each agent specification includes:

Full system prompt text
Available tools
Input/output types
Decision flow logic
Class implementation extending BaseAgent

1. Planner Agent

1.1 Purpose

Transforms natural language requirements into structured implementation plans with architecture decisions, task decomposition, and risk assessment.

1.2 System Prompt

markdown
# ROLE
You are a senior software architect creating implementation plans from requirements.

# TASK
Given a task description and codebase context, you will:
1. Analyze the requirements and decompose them into implementable units
2. Design system architecture and component interfaces
3. Create an ordered task list with dependencies
4. Assess implementation risk and complexity
5. Estimate effort and identify blockers

# AVAILABLE TOOLS
- read_file(path): Read file contents
- glob(pattern): Find files matching pattern
- grep(pattern, path?): Search for text in files
- llm_analyze(text, question): Ask follow-up questions to an LLM

# ARCHITECTURE DECISIONS
When designing architecture:
- Prefer existing patterns in the codebase
- Minimize new dependencies
- Consider testability and maintainability
- Document trade-offs explicitly
- Flag breaking changes

# RISK ASSESSMENT
Risk level is determined by:
- **Low**: Bug fix, UI tweak, refactor of non-critical code
- **Medium**: New feature, moderate complexity, some integration needed
- **High**: Core system changes, breaking changes, security-sensitive
- **Critical**: Authentication, payment processing, data integrity

Risk score components:
- Complexity (cyclomatic, cognitive)
- Criticality (affects core business logic)
- Change size (lines of code)
- Test coverage (existing vs. new)
- Uncertainty (novel patterns, unclear requirements)

# OUTPUT FORMAT
Return a structured ImplementationPlan with:

```typescript
{
  architecture: {
    components: Component[],      // What needs to be built/changed
    interfaces: InterfaceSpec[],   // APIs, types, contracts
    decisions: ArchDecision[],     // Key decisions with rationale
    dependencies: string[]         // New packages needed
  },
  tasks: Task[],                   // Ordered list with dependencies
  risk: {
    level: 'low' | 'medium' | 'high' | 'critical',
    score: number,                 // 0-100
    factors: { complexity, criticality, changeSize, coverage, uncertainty },
    mitigations: string[]          // How to reduce risk
  },
  estimates: {
    complexity: number,            // Story points or T-shirt size
    effort: number,                // Hours estimate
    uncertainty: 'low' | 'medium' | 'high'
  }
}

MEMORY INTEGRATION

Before planning, query memory for:

Similar tasks (context: task description)
Architecture patterns (context: feature area)
Past failures (context: similar requirements)

Apply relevant learnings to this plan.

ITERATION

You may use tools multiple times to:

Explore the codebase structure
Understand existing patterns
Validate assumptions
Refine the plan

When done, return { done: true, result: implementationPlan }.


### 1.3 Tools Available

```typescript
const plannerTools: Tool[] = [
  {
    name: 'read_file',
    description: 'Read the contents of a file',
    schema: {
      input: z.object({ path: z.string() }),
      output: z.object({ content: z.string() })
    },
    execute: async ({ path }) => {
      return { content: await fs.readFile(path, 'utf-8') };
    }
  },
  {
    name: 'glob',
    description: 'Find files matching a glob pattern',
    schema: {
      input: z.object({ pattern: z.string() }),
      output: z.object({ files: z.array(z.string()) })
    },
    execute: async ({ pattern }) => {
      return { files: await glob(pattern) };
    }
  },
  {
    name: 'grep',
    description: 'Search for text in files',
    schema: {
      input: z.object({
        pattern: z.string(),
        path: z.string().optional()
      }),
      output: z.object({
        matches: z.array(z.object({
          file: z.string(),
          line: z.number(),
          content: z.string()
        }))
      })
    },
    execute: async ({ pattern, path }) => {
      return { matches: await searchFiles(pattern, path) };
    }
  },
  {
    name: 'llm_analyze',
    description: 'Ask a follow-up question to analyze context',
    schema: {
      input: z.object({ text: z.string(), question: z.string() }),
      output: z.object({ answer: z.string() })
    },
    execute: async ({ text, question }, ctx) => {
      const response = await ctx.llm.chat({
        system: 'You are a helpful assistant analyzing code and requirements.',
        messages: [{ role: 'user', content: `${text}\n\nQuestion: ${question}` }],
        temperature: 0.2
      });
      return { answer: response.content };
    }
  }
];

1.4 Input/Output Types

typescript
interface PlannerInput {
  task: string;              // Natural language description
  codebaseContext?: {
    rootPath: string;
    language: string;
    frameworks: string[];
    existingPatterns?: string[];
  };
}

interface ImplementationPlan {
  architecture: {
    components: Component[];
    interfaces: InterfaceSpec[];
    decisions: ArchDecision[];
    dependencies: string[];
  };
  tasks: Task[];
  risk: RiskAssessment;
  estimates: {
    complexity: number;
    effort: number;
    uncertainty: 'low' | 'medium' | 'high';
  };
}

interface Component {
  id: string;
  name: string;
  type: 'service' | 'library' | 'function' | 'module';
  responsibilities: string[];
  interfaces: string[];
  dependencies: string[];
}

interface InterfaceSpec {
  id: string;
  name: string;
  type: 'api' | 'function' | 'type' | 'event';
  signature: string;
  documentation: string;
}

interface ArchDecision {
  id: string;
  title: string;
  context: string;
  decision: string;
  consequences: string[];
  alternatives: string[];
}

interface Task {
  id: string;
  title: string;
  description: string;
  type: 'implement' | 'test' | 'document' | 'refactor';
  dependencies: string[];   // Task IDs that must complete first
  estimatedEffort: number;  // Hours
  files: string[];          // Files expected to change
}

interface RiskAssessment {
  level: 'low' | 'medium' | 'high' | 'critical';
  score: number;  // 0-100
  factors: {
    complexity: number;      // 0-20
    criticality: number;     // 0-20
    changeSize: number;      // 0-20
    coverage: number;        // 0-20
    uncertainty: number;     // 0-20
  };
  mitigations: string[];
}

1.5 Decision Flow

┌─────────────────────────────────────────┐
│ 1. PERCEIVE                             │
│   - Load task description               │
│   - Query memory for similar tasks      │
│   - Explore codebase structure          │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 2. REASON - Decompose Requirements      │
│   - Break down task into components     │
│   - Identify what exists vs. new        │
│   - Map to existing patterns            │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 3. REASON - Design Architecture         │
│   - Define component boundaries         │
│   - Specify interfaces                  │
│   - Make technology decisions           │
│   - Document trade-offs                 │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 4. REASON - Create Task List            │
│   - Order tasks by dependencies         │
│   - Estimate effort per task            │
│   - Identify critical path              │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 5. REASON - Assess Risk                 │
│   - Calculate risk score                │
│   - Identify risk factors               │
│   - Suggest mitigations                 │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 6. OUTPUT - Return ImplementationPlan   │
└─────────────────────────────────────────┘

1.6 Risk Scoring Logic

typescript
function calculateRiskScore(plan: ImplementationPlan): number {
  // Each factor scored 0-20
  const factors = {
    complexity: scoreComplexity(plan),      // Cyclomatic + cognitive
    criticality: scoreCriticality(plan),    // Business impact
    changeSize: scoreChangeSize(plan),      // Lines of code
    coverage: scoreCoverage(plan),          // Test coverage delta
    uncertainty: scoreUncertainty(plan)     // Novel patterns, unclear reqs
  };

  return Object.values(factors).reduce((a, b) => a + b, 0);
}

function scoreComplexity(plan: ImplementationPlan): number {
  // Estimate cyclomatic complexity from task descriptions
  const estimatedComplexity = plan.tasks.reduce((acc, task) => {
    if (task.description.includes('conditional') ||
        task.description.includes('loop') ||
        task.description.includes('switch')) {
      return acc + 5;
    }
    return acc + 1;
  }, 0);

  // Normalize to 0-20
  return Math.min(20, estimatedComplexity / plan.tasks.length);
}

function scoreCriticality(plan: ImplementationPlan): number {
  const criticalKeywords = [
    'auth', 'payment', 'security', 'data integrity',
    'user data', 'credentials', 'encryption'
  ];

  const taskText = plan.tasks.map(t => t.description).join(' ').toLowerCase();
  const matches = criticalKeywords.filter(kw => taskText.includes(kw)).length;

  return Math.min(20, matches * 5);
}

function scoreChangeSize(plan: ImplementationPlan): number {
  const totalFiles = new Set(plan.tasks.flatMap(t => t.files)).size;

  if (totalFiles < 3) return 2;
  if (totalFiles < 10) return 8;
  if (totalFiles < 20) return 14;
  return 20;
}

function scoreCoverage(plan: ImplementationPlan): number {
  // How much new code vs. existing code with tests
  const hasTests = plan.tasks.some(t => t.type === 'test');
  const hasNewCode = plan.tasks.some(t => t.type === 'implement');

  if (!hasNewCode) return 0;
  if (hasTests) return 5;
  return 15;  // High risk if no tests planned
}

function scoreUncertainty(plan: ImplementationPlan): number {
  return plan.estimates.uncertainty === 'high' ? 15 :
         plan.estimates.uncertainty === 'medium' ? 8 : 3;
}

function determineRiskLevel(score: number): RiskLevel {
  if (score < 20) return 'low';
  if (score < 50) return 'medium';
  if (score < 70) return 'high';
  return 'critical';
}

1.7 Memory Queries

Before planning, the agent queries memory for relevant patterns:

typescript
async function queryRelevantMemories(
  task: string,
  ctx: AgentContext
): Promise<Memory[]> {
  const queries = [
    {
      type: 'episodic' as const,
      context: `planning similar task: ${task}`,
      limit: 5
    },
    {
      type: 'semantic' as const,
      context: `architecture patterns for ${extractFeatureArea(task)}`,
      limit: 5
    },
    {
      type: 'procedural' as const,
      context: `planning strategy`,
      limit: 3
    }
  ];

  const results = await Promise.all(
    queries.map(q => ctx.memory.recall(q))
  );

  return results.flat();
}

function extractFeatureArea(task: string): string {
  // Simple keyword extraction - could use LLM for better results
  const keywords = ['auth', 'payment', 'search', 'notification', 'analytics'];
  return keywords.find(k => task.toLowerCase().includes(k)) || 'general';
}

1.8 Class Implementation

typescript
import { BaseAgent } from './base';
import type { Agent, PhaseInput, PhaseOutput, AgentContext } from '../core/types';

export class PlannerAgent extends BaseAgent {
  type = 'planner' as const;
  tools = plannerTools;
  systemPrompt = PLANNER_SYSTEM_PROMPT; // From section 1.2

  async execute(
    input: PlannerInput,
    ctx: AgentContext
  ): Promise<ImplementationPlan> {
    // This calls the base agent loop which handles:
    // - Iteration tracking
    // - Circuit breaker checks
    // - Tool execution
    // - Reflection on completion
    return super.execute(input, ctx) as Promise<ImplementationPlan>;
  }

  protected buildPrompt(
    input: PlannerInput,
    memories: Memory[]
  ): string {
    let prompt = `# Task\n${input.task}\n\n`;

    if (input.codebaseContext) {
      prompt += `# Codebase Context\n`;
      prompt += `- Language: ${input.codebaseContext.language}\n`;
      prompt += `- Frameworks: ${input.codebaseContext.frameworks.join(', ')}\n`;
      prompt += `- Root: ${input.codebaseContext.rootPath}\n\n`;
    }

    if (memories.length > 0) {
      prompt += `# Relevant Learnings\n`;
      memories.forEach(m => {
        prompt += `- ${m.content} (confidence: ${m.confidence.toFixed(2)})\n`;
      });
      prompt += '\n';
    }

    prompt += `Create an implementation plan following the OUTPUT FORMAT.`;

    return prompt;
  }
}

2. Implementer Agent

2.1 Purpose

Takes an implementation plan and generates code + tests, with self-validation and refinement based on feedback from the Reviewer.

2.2 System Prompt

markdown
# ROLE
You are a senior software engineer implementing features from specifications.

# TASK
Given an implementation plan (or a plan + fix findings from review), you will:
1. Read relevant files to understand existing patterns
2. Generate code following the task specifications
3. Create comprehensive tests for new code
4. Self-validate: run typecheck and tests
5. Fix issues if validation fails
6. If this is a bounce-back from review, address specific findings

# AVAILABLE TOOLS
- read_file(path): Read file contents
- write_file(path, content): Write/overwrite a file
- run_command(cmd): Execute a shell command (typecheck, test, lint)
- search_code(query): Search for code patterns
- llm_generate(prompt, context): Generate code with LLM assistance

# IMPLEMENTATION STRATEGY
**File-by-File Approach (MVP):**
- Process tasks sequentially
- For each task, identify target files
- Read existing code, understand patterns
- Generate changes, write files
- Validate changes (typecheck, test)
- Fix issues before moving to next task

**Future: Feature-by-Feature with Swarm:**
- Spawn sub-agents for parallel implementation
- Each sub-agent handles one feature/module
- Coordinate integration

# SELF-VALIDATION LOOP
After writing code:
1. Run typecheck → fix type errors
2. Run affected tests → fix failures
3. Run lint → auto-fix formatting
4. Loop until clean or max iterations (5)

# HANDLING REVIEW FEEDBACK
If input includes `fixFindings`:
- Read each finding (file, line, message)
- Understand the issue
- Apply fix
- Validate fix doesn't break tests
- Re-run self-validation

# OUTPUT FORMAT
```typescript
{
  files: FileChange[],
  testsAdded: string[],
  validated: boolean,
  validationResults: {
    typecheck: { passed: boolean, errors: string[] },
    tests: { passed: boolean, failures: string[] },
    lint: { passed: boolean, issues: string[] }
  }
}

MEMORY INTEGRATION

Query memory for:

Implementation patterns (context: language + feature)
Common pitfalls (context: similar tasks)
Test strategies (context: file type)

ITERATION

Use tools iteratively to:

Read code to understand patterns
Generate code
Validate code
Fix issues

When done and validated, return { done: true, result: codeChanges }.


### 2.3 Tools Available

```typescript
const implementerTools: Tool[] = [
  {
    name: 'read_file',
    description: 'Read file contents',
    schema: {
      input: z.object({ path: z.string() }),
      output: z.object({ content: z.string() })
    },
    execute: async ({ path }) => {
      return { content: await fs.readFile(path, 'utf-8') };
    }
  },
  {
    name: 'write_file',
    description: 'Write or overwrite a file',
    schema: {
      input: z.object({ path: z.string(), content: z.string() }),
      output: z.object({ success: z.boolean() })
    },
    execute: async ({ path, content }) => {
      await fs.writeFile(path, content, 'utf-8');
      return { success: true };
    }
  },
  {
    name: 'run_command',
    description: 'Execute a shell command',
    schema: {
      input: z.object({ cmd: z.string() }),
      output: z.object({
        exitCode: z.number(),
        stdout: z.string(),
        stderr: z.string()
      })
    },
    execute: async ({ cmd }, ctx) => {
      return await ctx.shell.exec(cmd);
    }
  },
  {
    name: 'search_code',
    description: 'Search for code patterns',
    schema: {
      input: z.object({ query: z.string(), filePattern: z.string().optional() }),
      output: z.object({
        results: z.array(z.object({
          file: z.string(),
          line: z.number(),
          snippet: z.string()
        }))
      })
    },
    execute: async ({ query, filePattern }) => {
      return { results: await searchCodebase(query, filePattern) };
    }
  },
  {
    name: 'llm_generate',
    description: 'Generate code using LLM',
    schema: {
      input: z.object({ prompt: z.string(), context: z.string().optional() }),
      output: z.object({ code: z.string() })
    },
    execute: async ({ prompt, context }, ctx) => {
      const response = await ctx.llm.chat({
        system: 'You are a code generation assistant. Output only code, no explanations.',
        messages: [{
          role: 'user',
          content: context ? `${context}\n\n${prompt}` : prompt
        }],
        temperature: 0.2
      });
      return { code: response.content };
    }
  }
];

2.4 Input/Output Types

typescript
interface ImplementerInput {
  plan: ImplementationPlan;

  // Bounce-back from reviewer
  existingCode?: CodeChanges;
  fixFindings?: Finding[];

  // Bounce-back from tester
  fixFailures?: FailureAnalysis[];
}

interface CodeChanges {
  files: FileChange[];
  testsAdded: string[];
  validated: boolean;
  validationResults: {
    typecheck: ValidationResult;
    tests: ValidationResult;
    lint: ValidationResult;
  };
}

interface FileChange {
  path: string;
  action: 'created' | 'modified' | 'deleted';
  before?: string;
  after: string;
  diff?: string;
}

interface ValidationResult {
  passed: boolean;
  errors?: string[];
  warnings?: string[];
}

2.5 Implementation Strategy

┌─────────────────────────────────────────┐
│ 1. PERCEIVE                             │
│   - Load implementation plan            │
│   - Query memory for patterns           │
│   - If bounce-back, load findings       │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 2. For each task in plan.tasks:         │
│   ┌─────────────────────────────────┐   │
│   │ a. Read target files            │   │
│   │ b. Understand existing patterns │   │
│   │ c. Generate code                │   │
│   │ d. Write files                  │   │
│   │ e. Generate tests               │   │
│   │ f. Write test files             │   │
│   └──────────────┬──────────────────┘   │
│                  │                       │
└──────────────────┼───────────────────────┘
                   ▼
┌─────────────────────────────────────────┐
│ 3. SELF-VALIDATION LOOP                 │
│   (max 5 iterations)                    │
│   ┌─────────────────────────────────┐   │
│   │ a. Run typecheck                │   │
│   │ b. If errors → fix → retry      │   │
│   │ c. Run tests                    │   │
│   │ d. If failures → fix → retry    │   │
│   │ e. Run lint                     │   │
│   │ f. If issues → auto-fix         │   │
│   └─────────────────────────────────┘   │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 4. OUTPUT - Return CodeChanges          │
└─────────────────────────────────────────┘

2.6 Self-Validation Loop

typescript
async function selfValidate(
  files: FileChange[],
  ctx: AgentContext
): Promise<ValidationResults> {
  let iteration = 0;
  const maxIterations = 5;

  while (iteration < maxIterations) {
    iteration++;

    // Typecheck
    const typecheckResult = await ctx.shell.exec('bun run typecheck');
    if (typecheckResult.exitCode !== 0) {
      // Try to fix type errors
      const fixes = await generateTypeFixes(typecheckResult.stderr, ctx);
      if (fixes) {
        await applyFixes(fixes);
        continue; // Retry
      } else {
        // Can't auto-fix, return error
        return {
          typecheck: { passed: false, errors: [typecheckResult.stderr] },
          tests: { passed: false },
          lint: { passed: false }
        };
      }
    }

    // Test
    const testResult = await ctx.shell.exec('bun test');
    if (testResult.exitCode !== 0) {
      const fixes = await generateTestFixes(testResult.stderr, ctx);
      if (fixes) {
        await applyFixes(fixes);
        continue;
      } else {
        return {
          typecheck: { passed: true },
          tests: { passed: false, errors: [testResult.stderr] },
          lint: { passed: false }
        };
      }
    }

    // Lint
    const lintResult = await ctx.shell.exec('bun run lint --fix');

    // All passed
    return {
      typecheck: { passed: true },
      tests: { passed: true },
      lint: { passed: lintResult.exitCode === 0 }
    };
  }

  // Max iterations reached
  throw new Error('Self-validation failed after max iterations');
}

2.7 Handling Bounce-Back from Reviewer

When the reviewer sends back findings:

typescript
async function handleReviewFindings(
  findings: Finding[],
  existingCode: CodeChanges,
  ctx: AgentContext
): Promise<CodeChanges> {
  for (const finding of findings) {
    // Read the file with the issue
    const file = await ctx.tools.read_file({ path: finding.file });

    // Generate fix
    const fix = await ctx.llm.chat({
      system: 'You are fixing code review findings.',
      messages: [{
        role: 'user',
        content: `
File: ${finding.file}
Line: ${finding.line}
Issue: ${finding.message}
Suggested fix: ${finding.suggestedFix || 'none'}

Current code:
${file.content}

Fix the issue and return the complete updated file.
        `
      }],
      temperature: 0.1
    });

    // Write fixed file
    await ctx.tools.write_file({
      path: finding.file,
      content: fix.content
    });
  }

  // Re-validate
  const validationResults = await selfValidate(existingCode.files, ctx);

  return {
    ...existingCode,
    validated: validationResults.typecheck.passed && validationResults.tests.passed,
    validationResults
  };
}

2.8 Class Implementation

typescript
export class ImplementerAgent extends BaseAgent {
  type = 'implementer' as const;
  tools = implementerTools;
  systemPrompt = IMPLEMENTER_SYSTEM_PROMPT;

  async execute(
    input: ImplementerInput,
    ctx: AgentContext
  ): Promise<CodeChanges> {
    // If this is a bounce-back, handle findings first
    if (input.fixFindings && input.existingCode) {
      return this.handleReviewFindings(
        input.fixFindings,
        input.existingCode,
        ctx
      );
    }

    if (input.fixFailures && input.existingCode) {
      return this.handleTestFailures(
        input.fixFailures,
        input.existingCode,
        ctx
      );
    }

    // Normal implementation flow
    return super.execute(input, ctx) as Promise<CodeChanges>;
  }

  protected buildPrompt(
    input: ImplementerInput,
    memories: Memory[]
  ): string {
    let prompt = `# Implementation Plan\n`;

    input.plan.tasks.forEach((task, i) => {
      prompt += `\n## Task ${i + 1}: ${task.title}\n`;
      prompt += `${task.description}\n`;
      prompt += `Files: ${task.files.join(', ')}\n`;
    });

    if (memories.length > 0) {
      prompt += `\n# Relevant Patterns\n`;
      memories.forEach(m => {
        prompt += `- ${m.content}\n`;
      });
    }

    prompt += `\nImplement the tasks following the strategy in the system prompt.`;

    return prompt;
  }

  private async handleReviewFindings(
    findings: Finding[],
    existingCode: CodeChanges,
    ctx: AgentContext
  ): Promise<CodeChanges> {
    // Implementation from section 2.7
    return handleReviewFindings(findings, existingCode, ctx);
  }

  private async handleTestFailures(
    failures: FailureAnalysis[],
    existingCode: CodeChanges,
    ctx: AgentContext
  ): Promise<CodeChanges> {
    // Similar to handleReviewFindings but for test failures
    // TODO: Implement
    return existingCode;
  }
}

3. Reviewer Agent

3.1 Purpose

Performs multi-layer code review (static → security → AI) and calculates risk scores to determine gate decisions.

3.2 System Prompt

markdown
# ROLE
You are a senior code reviewer performing AI-assisted code review.

# TASK
You are the **third layer** of a three-layer review pipeline:
1. Static Analysis (already run)
2. Security Scan (already run)
3. **AI Review** (your task)

Given:
- Code changes (diff)
- Static analysis results
- Security scan results

You will:
1. Review code for logic correctness
2. Identify edge cases not covered
3. Assess performance implications
4. Evaluate architecture fit
5. Check for maintainability issues

# AVAILABLE TOOLS
- run_linter(files): Run ESLint/Biome
- run_security_scan(files): Run SAST + secret detection
- llm_review(diff, context): Deep review with LLM
- read_file(path): Read file contents

# REVIEW FOCUS AREAS
**Logic Correctness:**
- Are edge cases handled?
- Are error conditions properly managed?
- Is the happy path correct?
- Are race conditions possible?

**Architecture Fit:**
- Does this follow existing patterns?
- Are abstractions appropriate?
- Is coupling minimized?
- Are responsibilities clear?

**Performance:**
- Are there obvious performance issues?
- Could this cause memory leaks?
- Are queries optimized?
- Is caching appropriate?

**Maintainability:**
- Is the code readable?
- Are comments helpful (not redundant)?
- Is complexity reasonable?
- Would a new developer understand this?

# FINDING DEDUPLICATION
Before outputting findings:
1. Merge duplicate findings (same file + line)
2. If static analysis already caught it, don't repeat
3. Prioritize: security > correctness > performance > style

# RISK SCORE CALCULATION
Risk score = weighted sum of:
- Complexity: 30%
- Criticality: 25%
- Change size: 20%
- Test coverage: 15%
- Findings severity: 10%

# GATE DECISION LOGIC
- **approve**: score < 20 AND no critical findings
- **request_changes**: 20 ≤ score < 50 OR has error-level findings
- **require_human**: score ≥ 50 OR has critical security findings

# OUTPUT FORMAT
```typescript
{
  findings: Finding[],
  riskScore: {
    total: number,
    level: 'low' | 'medium' | 'high' | 'critical',
    breakdown: { complexity, criticality, changeSize, coverage, findings }
  },
  decision: 'approve' | 'request_changes' | 'require_human',
  reasoning: string
}

SKIP LOGIC

If risk.level === 'low' (from planner), you MAY skip AI review and rely only on static + security.


### 3.3 Tools Available

```typescript
const reviewerTools: Tool[] = [
  {
    name: 'run_linter',
    description: 'Run static analysis (ESLint/Biome)',
    schema: {
      input: z.object({ files: z.array(z.string()) }),
      output: z.object({
        issues: z.array(z.object({
          file: z.string(),
          line: z.number(),
          severity: z.enum(['info', 'warning', 'error']),
          message: z.string(),
          rule: z.string()
        }))
      })
    },
    execute: async ({ files }, ctx) => {
      const result = await ctx.shell.exec(`eslint ${files.join(' ')} --format json`);
      return { issues: JSON.parse(result.stdout) };
    }
  },
  {
    name: 'run_security_scan',
    description: 'Run security scans (SAST + secrets)',
    schema: {
      input: z.object({ files: z.array(z.string()) }),
      output: z.object({
        vulnerabilities: z.array(z.object({
          severity: z.enum(['low', 'medium', 'high', 'critical']),
          category: z.string(),
          file: z.string(),
          line: z.number(),
          description: z.string()
        }))
      })
    },
    execute: async ({ files }, ctx) => {
      // Run multiple security tools
      const sastResult = await ctx.shell.exec(`semgrep scan ${files.join(' ')} --json`);
      const secretResult = await ctx.shell.exec(`gitleaks detect --source . --verbose`);

      return {
        vulnerabilities: [
          ...parseSemgrepOutput(sastResult.stdout),
          ...parseGitleaksOutput(secretResult.stdout)
        ]
      };
    }
  },
  {
    name: 'llm_review',
    description: 'Perform deep AI code review',
    schema: {
      input: z.object({ diff: z.string(), context: z.string().optional() }),
      output: z.object({
        comments: z.array(z.object({
          file: z.string(),
          line: z.number(),
          severity: z.enum(['info', 'warning', 'error', 'critical']),
          category: z.enum(['logic', 'performance', 'maintainability', 'architecture']),
          message: z.string(),
          suggestedFix: z.string().optional()
        }))
      })
    },
    execute: async ({ diff, context }, ctx) => {
      const response = await ctx.llm.chat({
        system: `You are a code reviewer. Focus on logic, edge cases, performance, and maintainability.
Output a JSON array of review comments.`,
        messages: [{
          role: 'user',
          content: `${context || ''}\n\nDiff:\n${diff}\n\nProvide review comments as JSON.`
        }],
        temperature: 0.2
      });

      return { comments: JSON.parse(response.content) };
    }
  },
  {
    name: 'read_file',
    description: 'Read file for context',
    schema: {
      input: z.object({ path: z.string() }),
      output: z.object({ content: z.string() })
    },
    execute: async ({ path }) => {
      return { content: await fs.readFile(path, 'utf-8') };
    }
  }
];

3.4 Input/Output Types

typescript
interface ReviewerInput {
  codeChanges: CodeChanges;
  planRisk?: RiskAssessment;  // From planner, used for skip logic
}

interface ReviewResult {
  findings: Finding[];
  riskScore: RiskScore;
  decision: 'approve' | 'request_changes' | 'require_human';
  reasoning: string;
}

interface Finding {
  id: string;
  source: 'static' | 'security' | 'ai';
  file: string;
  line: number;
  severity: 'info' | 'warning' | 'error' | 'critical';
  category: 'style' | 'security' | 'correctness' | 'performance' | 'maintainability';
  message: string;
  confidence: number;  // 0-1
  fixable: boolean;
  suggestedFix?: string;
}

interface RiskScore {
  total: number;  // 0-100
  level: 'low' | 'medium' | 'high' | 'critical';
  breakdown: {
    complexity: number;
    criticality: number;
    changeSize: number;
    coverage: number;
    findings: number;
  };
}

3.5 Three-Layer Pipeline

┌─────────────────────────────────────────┐
│ LAYER 1: Static Analysis                │
│  - ESLint / Biome                       │
│  - TypeScript strict check              │
│  - Formatting check                     │
│  ⏱ Fast (seconds)                       │
│  💰 Cheap (free)                        │
│  ✓ Deterministic                        │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ LAYER 2: Security Scan                  │
│  - Secret detection (gitleaks)          │
│  - SAST (semgrep)                       │
│  - Dependency vulnerabilities (npm audit)│
│  ⏱ Fast (seconds)                       │
│  💰 Cheap (mostly free)                 │
│  ✓ High signal                          │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ LAYER 3: AI Review                      │
│  - Logic correctness                    │
│  - Edge cases                           │
│  - Performance implications             │
│  - Architecture fit                     │
│  ⏱ Slow (10-30 seconds)                 │
│  💰 Expensive ($0.01-0.10 per review)   │
│  ~ Variable quality                     │
│                                         │
│  SKIP if risk.level === 'low'           │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ Synthesis & Deduplication               │
│  - Merge duplicate findings             │
│  - Calculate final risk score           │
│  - Determine gate decision              │
└──────────────┬──────────────────────────┘
               ▼
         ReviewResult

3.6 Finding Deduplication

typescript
function deduplicateFindings(
  staticFindings: Finding[],
  securityFindings: Finding[],
  aiFindings: Finding[]
): Finding[] {
  const allFindings = [
    ...staticFindings.map(f => ({ ...f, source: 'static' as const })),
    ...securityFindings.map(f => ({ ...f, source: 'security' as const })),
    ...aiFindings.map(f => ({ ...f, source: 'ai' as const }))
  ];

  // Group by file + line
  const grouped = new Map<string, Finding[]>();
  allFindings.forEach(f => {
    const key = `${f.file}:${f.line}`;
    if (!grouped.has(key)) {
      grouped.set(key, []);
    }
    grouped.get(key)!.push(f);
  });

  // Deduplicate: prefer security > static > ai
  const deduplicated: Finding[] = [];
  grouped.forEach(findings => {
    const security = findings.find(f => f.source === 'security');
    if (security) {
      deduplicated.push(security);
      return;
    }

    const static_ = findings.find(f => f.source === 'static');
    if (static_) {
      deduplicated.push(static_);
      return;
    }

    // Take AI finding with highest confidence
    const ai = findings
      .filter(f => f.source === 'ai')
      .sort((a, b) => b.confidence - a.confidence)[0];
    if (ai) {
      deduplicated.push(ai);
    }
  });

  return deduplicated;
}

3.7 Risk Score Calculation

typescript
function calculateReviewRiskScore(
  changes: CodeChanges,
  findings: Finding[]
): RiskScore {
  // Complexity (0-30)
  const complexity = Math.min(30,
    changes.files.length * 2 +  // More files = higher complexity
    findings.filter(f => f.category === 'maintainability').length * 5
  );

  // Criticality (0-25)
  const criticalityKeywords = ['auth', 'security', 'payment', 'data'];
  const criticalFiles = changes.files.filter(f =>
    criticalityKeywords.some(kw => f.path.includes(kw))
  );
  const criticality = Math.min(25, criticalFiles.length * 10);

  // Change size (0-20)
  const totalLines = changes.files.reduce((acc, f) => {
    const lines = f.diff?.split('\n').filter(l => l.startsWith('+') || l.startsWith('-')).length || 0;
    return acc + lines;
  }, 0);
  const changeSize = Math.min(20, Math.floor(totalLines / 50));

  // Coverage (0-15)
  const hasTests = changes.testsAdded.length > 0;
  const coverage = hasTests ? 3 : 12;

  // Findings (0-10)
  const criticalCount = findings.filter(f => f.severity === 'critical').length;
  const errorCount = findings.filter(f => f.severity === 'error').length;
  const findingsScore = Math.min(10, criticalCount * 5 + errorCount * 2);

  const total = complexity + criticality + changeSize + coverage + findingsScore;

  return {
    total,
    level: total < 20 ? 'low' : total < 50 ? 'medium' : total < 70 ? 'high' : 'critical',
    breakdown: {
      complexity,
      criticality,
      changeSize,
      coverage,
      findings: findingsScore
    }
  };
}

3.8 Gate Decision Logic

typescript
function determineGateDecision(
  riskScore: RiskScore,
  findings: Finding[]
): { decision: ReviewDecision; reasoning: string } {
  // Critical security findings always require human
  const criticalSecurity = findings.some(
    f => f.severity === 'critical' && f.category === 'security'
  );
  if (criticalSecurity) {
    return {
      decision: 'require_human',
      reasoning: 'Critical security finding requires human review'
    };
  }

  // High or critical risk requires human
  if (riskScore.level === 'high' || riskScore.level === 'critical') {
    return {
      decision: 'require_human',
      reasoning: `Risk level ${riskScore.level} (score: ${riskScore.total}) requires human oversight`
    };
  }

  // Any error-level findings require changes
  const hasErrors = findings.some(f => f.severity === 'error');
  if (hasErrors) {
    return {
      decision: 'request_changes',
      reasoning: 'Code has error-level findings that must be addressed'
    };
  }

  // Medium risk with warnings → request changes
  if (riskScore.level === 'medium' && findings.length > 0) {
    return {
      decision: 'request_changes',
      reasoning: 'Medium-risk change with findings should be refined'
    };
  }

  // Low risk, no critical issues → approve
  return {
    decision: 'approve',
    reasoning: `Low risk (score: ${riskScore.total}), no blocking issues`
  };
}

3.9 Class Implementation

typescript
export class ReviewerAgent extends BaseAgent {
  type = 'reviewer' as const;
  tools = reviewerTools;
  systemPrompt = REVIEWER_SYSTEM_PROMPT;

  async execute(
    input: ReviewerInput,
    ctx: AgentContext
  ): Promise<ReviewResult> {
    // Layer 1: Static analysis
    const staticFindings = await this.runStaticAnalysis(input.codeChanges, ctx);

    // Layer 2: Security scan
    const securityFindings = await this.runSecurityScan(input.codeChanges, ctx);

    // Layer 3: AI review (skip if low risk)
    let aiFindings: Finding[] = [];
    if (!input.planRisk || input.planRisk.level !== 'low') {
      aiFindings = await this.runAIReview(input.codeChanges, ctx);
    } else {
      ctx.bus.emit({
        type: 'review.ai_skipped',
        payload: { reason: 'Low risk, static + security only' }
      });
    }

    // Deduplicate and synthesize
    const allFindings = deduplicateFindings(staticFindings, securityFindings, aiFindings);

    // Calculate risk
    const riskScore = calculateReviewRiskScore(input.codeChanges, allFindings);

    // Determine gate decision
    const { decision, reasoning } = determineGateDecision(riskScore, allFindings);

    return {
      findings: allFindings,
      riskScore,
      decision,
      reasoning
    };
  }

  private async runStaticAnalysis(
    changes: CodeChanges,
    ctx: AgentContext
  ): Promise<Finding[]> {
    const files = changes.files.map(f => f.path);
    const result = await this.tools.find(t => t.name === 'run_linter')!
      .execute({ files }, ctx as any);

    return (result as any).issues.map((issue: any) => ({
      id: ulid(),
      source: 'static',
      file: issue.file,
      line: issue.line,
      severity: issue.severity,
      category: 'style',
      message: issue.message,
      confidence: 1.0,
      fixable: true
    }));
  }

  private async runSecurityScan(
    changes: CodeChanges,
    ctx: AgentContext
  ): Promise<Finding[]> {
    const files = changes.files.map(f => f.path);
    const result = await this.tools.find(t => t.name === 'run_security_scan')!
      .execute({ files }, ctx as any);

    return (result as any).vulnerabilities.map((vuln: any) => ({
      id: ulid(),
      source: 'security',
      file: vuln.file,
      line: vuln.line,
      severity: vuln.severity,
      category: 'security',
      message: vuln.description,
      confidence: 0.95,
      fixable: false
    }));
  }

  private async runAIReview(
    changes: CodeChanges,
    ctx: AgentContext
  ): Promise<Finding[]> {
    const diff = changes.files.map(f => f.diff).join('\n');
    const result = await this.tools.find(t => t.name === 'llm_review')!
      .execute({ diff }, ctx as any);

    return (result as any).comments.map((comment: any) => ({
      id: ulid(),
      source: 'ai',
      file: comment.file,
      line: comment.line,
      severity: comment.severity,
      category: comment.category,
      message: comment.message,
      confidence: 0.7,
      fixable: !!comment.suggestedFix,
      suggestedFix: comment.suggestedFix
    }));
  }

  protected buildPrompt(input: ReviewerInput, memories: Memory[]): string {
    // AI review layer is invoked via llm_review tool
    // This method not used in current implementation
    return '';
  }
}

4. Tester Agent

4.1 Purpose

Selects and executes tests based on code changes, analyzes failures, and generates missing tests.

4.2 System Prompt

markdown
# ROLE
You are a QA engineer responsible for test execution and analysis.

# TASK
Given code changes and existing test suite, you will:
1. Select which tests to run (risk-based selection)
2. Execute selected tests
3. Analyze any failures (root cause, classification)
4. Identify test coverage gaps
5. Generate missing tests if needed

# AVAILABLE TOOLS
- run_tests(pattern, options): Execute test suite
- read_file(path): Read file contents
- write_file(path, content): Write test files
- llm_analyze(text, question): Analyze failures with LLM
- get_coverage(files): Get coverage report

# TEST SELECTION STRATEGY
**Always run:**
- Tests covering changed files
- Tests importing changed modules

**Risk-based addition:**
- Low risk: Only above
- Medium risk: + integration tests for affected features
- High risk: + full test suite

# FAILURE ANALYSIS
For each failure:
1. Classify: real bug | flaky test | environment issue | test needs update
2. Root cause analysis (use LLM)
3. Suggest fix (if auto-fixable)
4. Confidence score (0-1)

Only bounce back to implementer if:
- Failure is a real bug
- Confidence > 0.7 in root cause
- Have suggested fix

Otherwise, escalate to human.

# TEST GENERATION
Generate tests if:
- Coverage delta < 70% (new code not tested)
- Critical paths not covered
- Edge cases identified but not tested

# OUTPUT FORMAT
```typescript
{
  summary: {
    total: number,
    passed: number,
    failed: number,
    skipped: number
  },
  coverage: {
    line: number,
    branch: number,
    function: number,
    diffCoverage: number  // % of changed lines covered
  },
  failures: FailureAnalysis[],
  generatedTests: string[]
}

FLAKINESS DETECTION

If a test fails, retry once. If passes on retry:

Mark as flaky
Report to monitoring
Don't fail the build

ITERATION

Use tools to:

Execute tests
Analyze failures
Generate new tests
Re-run tests

When all tests pass or failures analyzed, return { done: true, result: testResult }.


### 4.3 Tools Available

```typescript
const testerTools: Tool[] = [
  {
    name: 'run_tests',
    description: 'Execute test suite',
    schema: {
      input: z.object({
        pattern: z.string().optional(),
        options: z.object({
          coverage: z.boolean().optional(),
          bail: z.boolean().optional(),
          timeout: z.number().optional()
        }).optional()
      }),
      output: z.object({
        exitCode: z.number(),
        results: z.array(z.object({
          file: z.string(),
          name: z.string(),
          status: z.enum(['passed', 'failed', 'skipped']),
          duration: z.number(),
          error: z.string().optional()
        })),
        coverage: z.object({
          line: z.number(),
          branch: z.number(),
          function: z.number()
        }).optional()
      })
    },
    execute: async ({ pattern, options }, ctx) => {
      const cmd = `bun test ${pattern || ''} ${options?.coverage ? '--coverage' : ''}`;
      const result = await ctx.shell.exec(cmd);
      return parseTestOutput(result);
    }
  },
  {
    name: 'read_file',
    description: 'Read file contents',
    schema: {
      input: z.object({ path: z.string() }),
      output: z.object({ content: z.string() })
    },
    execute: async ({ path }) => {
      return { content: await fs.readFile(path, 'utf-8') };
    }
  },
  {
    name: 'write_file',
    description: 'Write test file',
    schema: {
      input: z.object({ path: z.string(), content: z.string() }),
      output: z.object({ success: z.boolean() })
    },
    execute: async ({ path, content }) => {
      await fs.writeFile(path, content, 'utf-8');
      return { success: true };
    }
  },
  {
    name: 'llm_analyze',
    description: 'Analyze test failures',
    schema: {
      input: z.object({ failure: z.string(), context: z.string().optional() }),
      output: z.object({
        classification: z.enum(['bug', 'flaky', 'environment', 'test_outdated']),
        rootCause: z.string(),
        suggestedFix: z.string().optional(),
        confidence: z.number()
      })
    },
    execute: async ({ failure, context }, ctx) => {
      const response = await ctx.llm.chat({
        system: `You are analyzing test failures. Classify and diagnose.`,
        messages: [{
          role: 'user',
          content: `${context || ''}\n\nFailure:\n${failure}\n\nProvide analysis as JSON.`
        }],
        temperature: 0.1
      });
      return JSON.parse(response.content);
    }
  },
  {
    name: 'get_coverage',
    description: 'Get coverage for specific files',
    schema: {
      input: z.object({ files: z.array(z.string()) }),
      output: z.object({
        files: z.array(z.object({
          path: z.string(),
          line: z.number(),
          branch: z.number(),
          uncovered: z.array(z.number())
        }))
      })
    },
    execute: async ({ files }, ctx) => {
      // Parse coverage report for specific files
      return getCoverageForFiles(files);
    }
  }
];

4.4 Input/Output Types

typescript
interface TesterInput {
  codeChanges: CodeChanges;
  planRisk?: RiskAssessment;
}

interface TestResult {
  summary: {
    total: number;
    passed: number;
    failed: number;
    skipped: number;
  };
  coverage: {
    line: number;
    branch: number;
    function: number;
    diffCoverage: number;
  };
  failures: FailureAnalysis[];
  generatedTests: string[];
}

interface FailureAnalysis {
  testName: string;
  testFile: string;
  classification: 'bug' | 'flaky' | 'environment' | 'test_outdated';
  rootCause: string;
  suggestedFix?: string;
  confidence: number;
  errorMessage: string;
  stackTrace: string;
}

4.5 Test Selection Strategy

┌─────────────────────────────────────────┐
│ 1. Identify Changed Files              │
│   - Files modified in codeChanges       │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 2. Find Direct Tests                    │
│   - Tests importing changed files       │
│   - Tests in same directory             │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 3. Risk-Based Expansion                 │
│   Low risk:    Direct tests only        │
│   Medium risk: + integration tests      │
│   High risk:   + full suite             │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 4. Execute Selected Tests               │
└─────────────────────────────────────────┘

typescript
function selectTests(
  changes: CodeChanges,
  risk: RiskAssessment,
  allTests: string[]
): string[] {
  const changedFiles = new Set(changes.files.map(f => f.path));

  // Always run tests covering changed files
  const directTests = allTests.filter(testFile => {
    const imports = getImportsFromFile(testFile);
    return imports.some(imp => changedFiles.has(imp));
  });

  // Risk-based expansion
  if (risk.level === 'low') {
    return directTests;
  }

  if (risk.level === 'medium') {
    // Add integration tests for affected features
    const features = extractFeatures(changes);
    const integrationTests = allTests.filter(t =>
      t.includes('integration') &&
      features.some(f => t.includes(f))
    );
    return [...directTests, ...integrationTests];
  }

  // High or critical: run everything
  return allTests;
}

4.6 Failure Analysis

typescript
async function analyzeFailure(
  failure: TestFailure,
  ctx: AgentContext
): Promise<FailureAnalysis> {
  // Retry once to check for flakiness
  const retryResult = await ctx.tools.run_tests({
    pattern: failure.testFile
  });

  const retryTest = retryResult.results.find(r => r.name === failure.testName);
  if (retryTest?.status === 'passed') {
    return {
      testName: failure.testName,
      testFile: failure.testFile,
      classification: 'flaky',
      rootCause: 'Test is flaky - passed on retry',
      confidence: 0.9,
      errorMessage: failure.error || '',
      stackTrace: failure.stackTrace || ''
    };
  }

  // Not flaky - analyze root cause
  const analysis = await ctx.tools.llm_analyze({
    failure: `${failure.error}\n${failure.stackTrace}`,
    context: `Test: ${failure.testName}\nFile: ${failure.testFile}`
  });

  return {
    testName: failure.testName,
    testFile: failure.testFile,
    classification: analysis.classification,
    rootCause: analysis.rootCause,
    suggestedFix: analysis.suggestedFix,
    confidence: analysis.confidence,
    errorMessage: failure.error || '',
    stackTrace: failure.stackTrace || ''
  };
}

4.7 Test Generation

typescript
async function generateMissingTests(
  changes: CodeChanges,
  coverage: CoverageReport,
  ctx: AgentContext
): Promise<string[]> {
  const generatedFiles: string[] = [];

  // Find files with low coverage
  for (const file of changes.files) {
    const fileCoverage = coverage.files.find(f => f.path === file.path);
    if (!fileCoverage || fileCoverage.line < 70) {
      // Generate tests
      const sourceCode = await ctx.tools.read_file({ path: file.path });

      const testCode = await ctx.llm.chat({
        system: `Generate comprehensive tests for this code. Use the testing framework already in use.`,
        messages: [{
          role: 'user',
          content: `File: ${file.path}\n\n${sourceCode.content}\n\nGenerate tests covering all functions and edge cases.`
        }],
        temperature: 0.2
      });

      // Write test file
      const testPath = file.path.replace(/\.(ts|js)$/, '.test.$1');
      await ctx.tools.write_file({
        path: testPath,
        content: testCode.content
      });

      generatedFiles.push(testPath);
    }
  }

  return generatedFiles;
}

4.8 Class Implementation

typescript
export class TesterAgent extends BaseAgent {
  type = 'tester' as const;
  tools = testerTools;
  systemPrompt = TESTER_SYSTEM_PROMPT;

  async execute(
    input: TesterInput,
    ctx: AgentContext
  ): Promise<TestResult> {
    // Select tests
    const allTests = await this.getAllTests();
    const selectedTests = selectTests(
      input.codeChanges,
      input.planRisk || { level: 'medium' },
      allTests
    );

    ctx.bus.emit({
      type: 'test.selection',
      payload: { total: allTests.length, selected: selectedTests.length }
    });

    // Execute tests
    const testResults = await this.executeTests(selectedTests, ctx);

    // Analyze failures
    const failures = await this.analyzeFailures(testResults.failures, ctx);

    // Get coverage
    const coverage = await this.getCoverage(input.codeChanges, ctx);

    // Generate missing tests if needed
    let generatedTests: string[] = [];
    if (coverage.diffCoverage < 70) {
      generatedTests = await this.generateMissingTests(
        input.codeChanges,
        coverage,
        ctx
      );
    }

    return {
      summary: testResults.summary,
      coverage,
      failures,
      generatedTests
    };
  }

  private async getAllTests(): Promise<string[]> {
    return await glob('**/*.test.{ts,js}');
  }

  private async executeTests(
    tests: string[],
    ctx: AgentContext
  ): Promise<any> {
    const result = await this.tools.find(t => t.name === 'run_tests')!
      .execute({ pattern: tests.join(' '), options: { coverage: true } }, ctx as any);

    return result;
  }

  private async analyzeFailures(
    failures: TestFailure[],
    ctx: AgentContext
  ): Promise<FailureAnalysis[]> {
    return await Promise.all(
      failures.map(f => analyzeFailure(f, ctx))
    );
  }

  private async getCoverage(
    changes: CodeChanges,
    ctx: AgentContext
  ): Promise<CoverageReport> {
    const files = changes.files.map(f => f.path);
    const result = await this.tools.find(t => t.name === 'get_coverage')!
      .execute({ files }, ctx as any);

    return calculateCoverageMetrics(result);
  }

  private async generateMissingTests(
    changes: CodeChanges,
    coverage: CoverageReport,
    ctx: AgentContext
  ): Promise<string[]> {
    return await generateMissingTests(changes, coverage, ctx);
  }

  protected buildPrompt(input: TesterInput, memories: Memory[]): string {
    // Tester uses tools directly, prompt not heavily used
    return `Analyze test results for ${input.codeChanges.files.length} changed files.`;
  }
}

5. Deployer Agent

5.1 Purpose

Orchestrates deployment with build verification, human approval gates, canary rollout, and health monitoring.

5.2 System Prompt

markdown
# ROLE
You are a deployment engineer responsible for safe production releases.

# TASK
Given validated code and test results, you will:
1. Verify build artifact
2. Request human approval (if production)
3. Execute canary deployment
4. Monitor health metrics during rollout
5. Auto-rollback if unhealthy
6. Complete full rollout if healthy

# AVAILABLE TOOLS
- run_command(cmd): Execute build/deploy commands
- github_api(endpoint, data): Interact with GitHub
- read_file(path): Read config files
- wait(ms): Wait for duration

# BUILD VERIFICATION
Before deployment:
1. Build artifact
2. Verify checksum
3. Check artifact size (flag if > 2x previous)
4. Validate manifest

# HUMAN APPROVAL GATE
Always required for:
- Production environment
- Breaking changes
- High/critical risk

Request approval with:
- Summary of changes
- Risk assessment
- Test results
- Findings summary

# CANARY DEPLOYMENT LOGIC
Based on risk level:
- Low risk: 5% → 25% → 100% (5 min between stages)
- Medium risk: 5% → 10% → 25% → 50% → 100% (10 min between stages)
- High risk: 5% → 10% → 25% → 50% → 100% (30 min between stages)

At each stage:
1. Shift traffic
2. Wait for stabilization
3. Check health metrics
4. If unhealthy → auto-rollback
5. If healthy → continue

# HEALTH CHECKS
Monitor:
- Error rate vs. baseline (threshold: +10%)
- Latency p95 vs. baseline (threshold: +50%)
- Throughput vs. baseline (threshold: -20%)

# ROLLBACK TRIGGERS
Auto-rollback if:
- Error rate > baseline + 1%
- Latency p95 > baseline * 1.5
- Any critical health check fails

# OUTPUT FORMAT
```typescript
{
  status: 'healthy' | 'degraded' | 'rolled_back',
  stages: DeploymentStage[],
  metrics: {
    errorRate: number,
    latency: { p50, p95, p99 },
    throughput: number
  },
  url: string
}

ITERATION

Deployment is a linear process with checkpoints:

Build
Approval (if needed)
Deploy stage 1
Health check
Deploy stage 2
Health check
... until complete or rollback

Return { done: true, result: deploymentResult } when complete.


### 5.3 Tools Available

```typescript
const deployerTools: Tool[] = [
  {
    name: 'run_command',
    description: 'Execute shell command',
    schema: {
      input: z.object({ cmd: z.string() }),
      output: z.object({
        exitCode: z.number(),
        stdout: z.string(),
        stderr: z.string()
      })
    },
    execute: async ({ cmd }, ctx) => {
      return await ctx.shell.exec(cmd);
    }
  },
  {
    name: 'github_api',
    description: 'Interact with GitHub API',
    schema: {
      input: z.object({
        endpoint: z.string(),
        method: z.enum(['GET', 'POST', 'PUT', 'PATCH']),
        data: z.any().optional()
      }),
      output: z.object({ response: z.any() })
    },
    execute: async ({ endpoint, method, data }, ctx) => {
      const response = await fetch(`https://api.github.com${endpoint}`, {
        method,
        headers: {
          'Authorization': `token ${ctx.config.githubToken}`,
          'Content-Type': 'application/json'
        },
        body: data ? JSON.stringify(data) : undefined
      });
      return { response: await response.json() };
    }
  },
  {
    name: 'read_file',
    description: 'Read file',
    schema: {
      input: z.object({ path: z.string() }),
      output: z.object({ content: z.string() })
    },
    execute: async ({ path }) => {
      return { content: await fs.readFile(path, 'utf-8') };
    }
  },
  {
    name: 'wait',
    description: 'Wait for duration',
    schema: {
      input: z.object({ ms: z.number() }),
      output: z.object({ waited: z.number() })
    },
    execute: async ({ ms }) => {
      await new Promise(resolve => setTimeout(resolve, ms));
      return { waited: ms };
    }
  }
];

5.4 Input/Output Types

typescript
interface DeployerInput {
  codeChanges: CodeChanges;
  testResults: TestResult;
  reviewResults: ReviewResult;
  environment: 'staging' | 'production';
}

interface DeploymentResult {
  status: 'healthy' | 'degraded' | 'rolled_back';
  stages: DeploymentStage[];
  metrics: {
    errorRate: number;
    latency: { p50: number; p95: number; p99: number };
    throughput: number;
  };
  url: string;
}

interface DeploymentStage {
  name: string;
  trafficPercent: number;
  startTime: Date;
  endTime?: Date;
  status: 'pending' | 'deploying' | 'healthy' | 'unhealthy' | 'rolled_back';
  healthChecks: HealthCheck[];
}

interface HealthCheck {
  timestamp: Date;
  metric: string;
  value: number;
  baseline: number;
  threshold: number;
  passed: boolean;
}

5.5 Deployment Flow

┌─────────────────────────────────────────┐
│ 1. Build Verification                   │
│   - Run build command                   │
│   - Verify artifact                     │
│   - Check size                          │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 2. Human Approval Gate                  │
│   (if production or high risk)          │
│   - Show summary                        │
│   - Wait for approval                   │
│   - If rejected → stop                  │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 3. Canary Stage 1 (5%)                  │
│   - Shift 5% traffic                    │
│   - Wait 5 minutes                      │
│   - Health check                        │
│   - If unhealthy → rollback             │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 4. Canary Stage 2 (25%)                 │
│   - Shift 25% traffic                   │
│   - Wait 5 minutes                      │
│   - Health check                        │
│   - If unhealthy → rollback             │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 5. Full Rollout (100%)                  │
│   - Shift 100% traffic                  │
│   - Final health check                  │
└──────────────┬──────────────────────────┘
               ▼
┌─────────────────────────────────────────┐
│ 6. Monitor                              │
│   - Continuous health monitoring        │
│   - Report metrics                      │
└─────────────────────────────────────────┘

5.6 Health Check Logic

typescript
async function checkHealth(
  deployment: Deployment,
  baseline: Metrics
): Promise<{ healthy: boolean; issues: string[] }> {
  const current = await collectMetrics(deployment);

  const issues: string[] = [];

  // Error rate check
  const errorRateIncrease = current.errorRate - baseline.errorRate;
  if (errorRateIncrease > 0.01) {  // 1% absolute increase
    issues.push(`Error rate increased by ${(errorRateIncrease * 100).toFixed(2)}%`);
  }

  // Latency check
  const latencyIncrease = current.latency.p95 / baseline.latency.p95;
  if (latencyIncrease > 1.5) {  // 50% increase
    issues.push(`P95 latency increased by ${((latencyIncrease - 1) * 100).toFixed(0)}%`);
  }

  // Throughput check
  const throughputDecrease = current.throughput / baseline.throughput;
  if (throughputDecrease < 0.8) {  // 20% decrease
    issues.push(`Throughput decreased by ${((1 - throughputDecrease) * 100).toFixed(0)}%`);
  }

  return {
    healthy: issues.length === 0,
    issues
  };
}

5.7 Canary Stages by Risk

typescript
function getCanaryStages(risk: RiskLevel): CanaryConfig {
  switch (risk) {
    case 'low':
      return {
        stages: [
          { percent: 5, waitMs: 5 * 60_000 },
          { percent: 25, waitMs: 5 * 60_000 },
          { percent: 100, waitMs: 0 }
        ]
      };

    case 'medium':
      return {
        stages: [
          { percent: 5, waitMs: 10 * 60_000 },
          { percent: 10, waitMs: 10 * 60_000 },
          { percent: 25, waitMs: 10 * 60_000 },
          { percent: 50, waitMs: 10 * 60_000 },
          { percent: 100, waitMs: 0 }
        ]
      };

    case 'high':
    case 'critical':
      return {
        stages: [
          { percent: 5, waitMs: 30 * 60_000 },
          { percent: 10, waitMs: 30 * 60_000 },
          { percent: 25, waitMs: 30 * 60_000 },
          { percent: 50, waitMs: 30 * 60_000 },
          { percent: 100, waitMs: 0 }
        ]
      };
  }
}

5.8 Rollback Logic

typescript
async function rollback(
  deployment: Deployment,
  reason: string,
  ctx: AgentContext
): Promise<void> {
  ctx.bus.emit({
    type: 'deployment.rollback_started',
    payload: { deploymentId: deployment.id, reason }
  });

  // Shift all traffic back to previous version
  await ctx.tools.run_command({
    cmd: `kubectl set image deployment/${deployment.service} app=${deployment.previousVersion}`
  });

  // Wait for rollback to complete
  await ctx.tools.wait({ ms: 30_000 });

  // Verify rollback succeeded
  const health = await checkHealth(deployment, deployment.baselineMetrics);

  if (!health.healthy) {
    // Rollback itself failed - escalate
    await ctx.gates.escalate({
      type: 'rollback_failed',
      deployment,
      reason: health.issues.join(', ')
    });
  }

  ctx.bus.emit({
    type: 'deployment.rollback_completed',
    payload: { deploymentId: deployment.id, healthy: health.healthy }
  });
}

5.9 Class Implementation

typescript
export class DeployerAgent extends BaseAgent {
  type = 'deployer' as const;
  tools = deployerTools;
  systemPrompt = DEPLOYER_SYSTEM_PROMPT;

  async execute(
    input: DeployerInput,
    ctx: AgentContext
  ): Promise<DeploymentResult> {
    // Build
    const artifact = await this.buildArtifact(ctx);

    // Human approval for production
    if (input.environment === 'production') {
      const approved = await this.requestApproval(input, ctx);
      if (!approved) {
        throw new Error('Deployment rejected by human');
      }
    }

    // Get baseline metrics
    const baseline = await this.getBaselineMetrics(ctx);

    // Determine canary stages
    const canaryConfig = getCanaryStages(input.reviewResults.riskScore.level);

    // Execute canary deployment
    const stages: DeploymentStage[] = [];

    for (const stage of canaryConfig.stages) {
      const stageResult = await this.deployStage(
        artifact,
        stage.percent,
        stage.waitMs,
        baseline,
        ctx
      );

      stages.push(stageResult);

      if (stageResult.status === 'unhealthy') {
        // Rollback
        await rollback(
          { id: 'current', service: 'app', previousVersion: 'v1' } as any,
          'Health check failed',
          ctx
        );

        return {
          status: 'rolled_back',
          stages,
          metrics: await this.getCurrentMetrics(ctx),
          url: ctx.config.appUrl
        };
      }
    }

    // Success
    return {
      status: 'healthy',
      stages,
      metrics: await this.getCurrentMetrics(ctx),
      url: ctx.config.appUrl
    };
  }

  private async buildArtifact(ctx: AgentContext): Promise<BuildArtifact> {
    const result = await this.tools.find(t => t.name === 'run_command')!
      .execute({ cmd: 'bun run build' }, ctx as any);

    if (result.exitCode !== 0) {
      throw new Error(`Build failed: ${result.stderr}`);
    }

    return {
      id: ulid(),
      version: 'v1',
      checksum: 'abc123',
      size: 1024,
      files: [],
      metadata: {}
    };
  }

  private async requestApproval(
    input: DeployerInput,
    ctx: AgentContext
  ): Promise<boolean> {
    const approval = await ctx.gates.requestHumanApproval({
      type: 'deploy_production',
      context: {
        files: input.codeChanges.files.length,
        risk: input.reviewResults.riskScore.level,
        tests: input.testResults.summary,
        findings: input.reviewResults.findings.length
      },
      riskAssessment: input.reviewResults.riskScore as any,
      automatedChecks: []
    });

    return approval.decision === 'approved';
  }

  private async getBaselineMetrics(ctx: AgentContext): Promise<Metrics> {
    // Fetch current production metrics
    return {
      errorRate: 0.001,
      latency: { p50: 100, p95: 250, p99: 500 },
      throughput: 1000
    };
  }

  private async deployStage(
    artifact: BuildArtifact,
    percent: number,
    waitMs: number,
    baseline: Metrics,
    ctx: AgentContext
  ): Promise<DeploymentStage> {
    const stage: DeploymentStage = {
      name: `${percent}%`,
      trafficPercent: percent,
      startTime: new Date(),
      status: 'deploying',
      healthChecks: []
    };

    // Shift traffic
    await this.tools.find(t => t.name === 'run_command')!
      .execute({
        cmd: `kubectl set image deployment/app app=${artifact.version} --record`
      }, ctx as any);

    // Wait for stabilization
    await this.tools.find(t => t.name === 'wait')!
      .execute({ ms: waitMs }, ctx as any);

    // Health check
    const health = await checkHealth({ id: 'current' } as any, baseline);

    stage.status = health.healthy ? 'healthy' : 'unhealthy';
    stage.endTime = new Date();
    stage.healthChecks.push({
      timestamp: new Date(),
      metric: 'overall',
      value: health.healthy ? 1 : 0,
      baseline: 1,
      threshold: 1,
      passed: health.healthy
    });

    return stage;
  }

  private async getCurrentMetrics(ctx: AgentContext): Promise<any> {
    return {
      errorRate: 0.001,
      latency: { p50: 100, p95: 250, p99: 500 },
      throughput: 1000
    };
  }

  protected buildPrompt(input: DeployerInput, memories: Memory[]): string {
    return `Deploy to ${input.environment} with risk level ${input.reviewResults.riskScore.level}`;
  }
}

6. Summary and Integration

6.1 Agent Dependencies

PlannerAgent
     │
     ▼
ImplementerAgent ──┐
     ▲             │
     │             ▼
     │        ReviewAgent
     │             │
     │             ▼
     │        (bounce back if changes needed)
     │
     └───── (fixFindings) ───┐
                             │
                             ▼
                        TesterAgent
                             │
                             ▼
                        (bounce back if test failures)
                             │
                             └─── (fixFailures) ─────▶ ImplementerAgent
                                                             │
                                                             ▼
                                                        DeployerAgent

6.2 Shared Context

All agents access shared context via AgentContext:

typescript
interface AgentContext {
  traceId: string;
  bus: EventBus;
  memory: MemoryStore;
  llm: LLMProvider;
  tools: ToolRegistry;
  shell: ShellExecutor;
  config: RuntimeConfig;
  gates: HumanGateSystem;
  safety: SafetyControls;
  cost: CostTracker;
  elapsed: number;
}

6.3 Build Order

Week 1: BaseAgent + Planner (simple version, no memory queries)
Week 2: Implementer (file-by-file, no bounce-back)
Week 3: Reviewer (3-layer pipeline)
Week 4: Tester (selection + execution)
Week 5: Implementer bounce-back logic
Week 6: Deployer
Week 7: Memory integration for all agents
Week 8: Polish and edge cases

6.4 Testing Strategy

Each agent should have:

Unit tests for decision logic (risk scoring, test selection, etc.)
Integration tests with mock tools
End-to-end tests with real codebase (small fixture)
Cost tracking tests (ensure within budget)

7. Open Questions and Decisions Needed

Planner parallelization: Should planning happen in parallel for large features? (Start sequential)
Implementer swarm: When to introduce parallel sub-agents? (Post-MVP)
Review confidence thresholds: What confidence scores should gate decisions? (Start conservative: 0.7)
Test generation quality: How to validate generated tests are useful? (Coverage + manual review)
Deployment rollback automation: Always auto-rollback or require human? (Auto-rollback with notification)

End of Agent Designs Implementation Plan