devops

February 8, 2026

Observability and Telemetry System Design

Overview

This document specifies the observability infrastructure for agentic software development systems. It covers metrics, logging, tracing, dashboards, and alerting across all SDLC phases.

Date: 2026-02-06
Scope: End-to-end SDLC observability
Status: Specification

1. Metrics Architecture

1.1 Metric Categories

Category	Description	Cardinality
Performance	Timing, latency, throughput	Medium
Reliability	Errors, failures, success rates	Low
Efficiency	Cost, resource utilization, token usage	Medium
Quality	Test results, coverage, defects	Medium
Agent	Loop iterations, decisions, learning	High
Business	Lead time, deployment frequency	Low

1.2 Metric Definitions by Phase

Phase 1: Planning & Requirements

Metric Name	Type	Unit	Description	Labels
`sdlc.planning.duration`	Histogram	seconds	Time to complete planning phase	project, complexity
`sdlc.planning.iterations`	Counter	count	Number of refinement iterations	project
`sdlc.planning.human_interventions`	Counter	count	Times human input was requested	reason
`sdlc.planning.stories_generated`	Counter	count	User stories created	project
`sdlc.planning.architecture_options`	Gauge	count	Architecture alternatives considered	project
`agent.planner.token_usage`	Counter	tokens	LLM tokens consumed	model, operation
`agent.planner.cache_hit_ratio`	Gauge	ratio	Cache effectiveness	model

Phase 2: Implementation

Metric Name	Type	Unit	Description	Labels
`sdlc.implementation.duration`	Histogram	seconds	Time to implement	project, module_count
`sdlc.implementation.lines_written`	Counter	lines	Code lines generated	language, module
`sdlc.implementation.lines_modified`	Counter	lines	Code lines changed	language, operation
`sdlc.implementation.modules_completed`	Counter	count	Modules successfully implemented	project
`sdlc.implementation.parallel_agents`	Gauge	count	Concurrent worker agents	project
`agent.implementer.loop_iterations`	Counter	count	Agent loop iterations per task	task_type, outcome
`agent.implementer.tool_calls`	Counter	count	Tool invocations	tool_name, status
`agent.implementer.tool_latency`	Histogram	milliseconds	Tool execution time	tool_name
`agent.implementer.validation_failures`	Counter	count	Self-validation failures	failure_type
`agent.implementer.cost_per_task`	Histogram	USD	Cost per implementation task	complexity
`code.complexity.cyclomatic`	Gauge	score	Cyclomatic complexity	module, function
`code.quality.maintainability_index`	Gauge	score	Maintainability score	module

Phase 3: Code Review

Metric Name	Type	Unit	Description	Labels
`sdlc.review.duration`	Histogram	seconds	Time to complete review	review_type
`sdlc.review.comments_generated`	Counter	count	Total review comments	severity, analyzer
`sdlc.review.comments_per_line`	Gauge	ratio	Comment density	file
`sdlc.review.false_positive_rate`	Gauge	ratio	Incorrect AI suggestions	analyzer
`sdlc.review.human_review_time`	Histogram	seconds	Human reviewer time	risk_level
`agent.reviewer.analysis_latency`	Histogram	milliseconds	AI review response time	file_size
`agent.reviewer.confidence_score`	Gauge	score	AI confidence in findings	file
`agent.reviewer.suggestions_applied`	Counter	count	Auto-fixes applied	suggestion_type
`quality.gate.passed`	Counter	count	Gate check passes	gate_name
`quality.gate.failed`	Counter	count	Gate check failures	gate_name, reason

Phase 4: Testing

Metric Name	Type	Unit	Description	Labels
`sdlc.testing.duration`	Histogram	seconds	Total test execution time	test_suite
`sdlc.testing.tests_executed`	Counter	count	Total tests run	type, selector
`sdlc.testing.tests_selected`	Gauge	ratio	Percentage of test suite run	selection_strategy
`sdlc.testing.pass_rate`	Gauge	ratio	Test pass percentage	suite
`sdlc.testing.coverage.line`	Gauge	percent	Line coverage	module
`sdlc.testing.coverage.branch`	Gauge	percent	Branch coverage	module
`sdlc.testing.coverage.function`	Gauge	percent	Function coverage	module
`sdlc.testing.flakiness_rate`	Gauge	ratio	Flaky test percentage	test_name
`agent.tester.generation_time`	Histogram	seconds	Time to generate tests	target_type
`agent.tester.tests_generated`	Counter	count	Tests auto-generated	generation_type
`test.execution.duration`	Histogram	seconds	Individual test duration	test_name
`test.failure.analysis_time`	Histogram	seconds	RCA analysis duration	failure_type

Phase 5: CI/CD & Deployment

Metric Name	Type	Unit	Description	Labels
`sdlc.cicd.lead_time`	Histogram	seconds	Commit to production	project
`sdlc.cicd.cycle_time`	Histogram	seconds	Pipeline execution time	pipeline_id
`sdlc.cicd.queue_time`	Histogram	seconds	Time waiting for resources	runner_type
`sdlc.cicd.success_rate`	Gauge	ratio	Pipeline success rate	pipeline_id
`sdlc.cicd.rollback_rate`	Gauge	ratio	Percentage of rollbacks	project
`sdlc.cicd.mean_time_to_recovery`	Histogram	seconds	Recovery from failure	failure_type
`sdlc.deploy.canary_duration`	Histogram	seconds	Canary observation window	deployment_id
`sdlc.deploy.traffic_shift_duration`	Histogram	seconds	Time to shift traffic	strategy
`pipeline.cache.hit_rate`	Gauge	ratio	Build cache effectiveness	cache_type
`pipeline.parallelization.efficiency`	Gauge	ratio	Worker utilization	pipeline_id
`deployment.validation.duration`	Histogram	seconds	Post-deploy validation	check_type
`deployment.error_rate.canary`	Gauge	ratio	Error rate during canary	deployment_id

Phase 6: Monitoring & Feedback

Metric Name	Type	Unit	Description	Labels
`sdlc.monitoring.alert_count`	Counter	count	Alerts fired	severity, type
`sdlc.monitoring.mean_time_to_detect`	Histogram	seconds	Time to detect issues	detection_method
`agent.monitor.anomalies_detected`	Counter	count	Anomalies found	severity
`agent.monitor.synthetic_pass_rate`	Gauge	ratio	Synthetic test success	endpoint
`sdlc.feedback.defect_escape_rate`	Gauge	ratio	Defects found post-deploy	severity
`sdlc.feedback.user_satisfaction`	Gauge	score	User feedback scores	feature

Cross-Cutting: Agent Behavior

Metric Name	Type	Unit	Description	Labels
`agent.decision.count`	Counter	count	Decisions made	agent, decision_type
`agent.decision.confidence`	Gauge	score	Confidence in decisions	agent
`agent.decision.human_override`	Counter	count	Human overrides	agent, reason
`agent.learning.patterns_learned`	Counter	count	New patterns stored	pattern_type
`agent.learning.reflections`	Counter	count	Reflection cycles	trigger_type
`agent.memory.operations`	Counter	count	Memory read/write	operation
`agent.orchestrator.handoffs`	Counter	count	Agent handoffs	from_agent, to_agent
`agent.cost.total`	Counter	USD	Total LLM cost	agent, model
`agent.cost.per_request`	Histogram	USD	Cost per request	agent

1.3 Metric Collection Methods

yaml
collection_methods:
  # Push-based: Agents send metrics directly
  push:
    - agent_internal_metrics    # Built-in agent telemetry
    - custom_business_metrics   # Application-specific
  
  # Pull-based: Scraped by collector
  pull:
    - prometheus_exporters      # Standard exporters
    - application_endpoints     # /metrics endpoints
  
  # Derived: Computed from other metrics
  derived:
    - rate_calculations         # rates_over_time()
    - aggregations              # sum by (label)
    - ratios                    # error_rate / total

2. Logging and Tracing Strategy

2.1 Structured Logging

Log Levels

Level	Usage	Retention
`DEBUG`	Detailed agent reasoning, tool inputs/outputs	7 days
`INFO`	Normal operations, phase transitions	30 days
`WARN`	Degraded performance, retry attempts	90 days
`ERROR`	Failures, exceptions, circuit breaker triggers	1 year
`CRITICAL`	Safety stops, human escalation	Permanent

Log Schema

typescript
interface AgentLogEntry {
  // Identity
  timestamp: string;           // ISO 8601 with nanoseconds
  trace_id: string;            // W3C trace context
  span_id: string;
  parent_span_id?: string;
  
  // Source
  agent_id: string;
  agent_type: 'planner' | 'implementer' | 'reviewer' | 'tester' | 'deployer' | 'monitor' | 'orchestrator';
  phase: 'planning' | 'implementation' | 'review' | 'testing' | 'deployment' | 'monitoring';
  version: string;
  
  // Content
  level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'CRITICAL';
  message: string;
  event_type: string;
  
  // Context
  context: {
    project_id: string;
    task_id: string;
    iteration?: number;
    session_id: string;
  };
  
  // Payload
  payload?: {
    input?: unknown;
    output?: unknown;
    duration_ms?: number;
    token_usage?: {
      prompt: number;
      completion: number;
      total: number;
    };
    cost_usd?: number;
  };
  
  // Metadata
  labels: Record<string, string>;
}

2.2 Event Schemas

Event Types by Phase

typescript
// Planning Phase Events
interface PlanningStarted {
  event_type: 'planning.started';
  payload: {
    requirements_summary: string;
    estimated_complexity: 'low' | 'medium' | 'high';
  };
}

interface PlanningCompleted {
  event_type: 'planning.completed';
  payload: {
    stories_count: number;
    architecture_selected: string;
    duration_ms: number;
  };
}

// Implementation Phase Events
interface ImplementationStarted {
  event_type: 'implementation.started';
  payload: {
    module_count: number;
    parallel_workers: number;
  };
}

interface ToolExecution {
  event_type: 'tool.executed';
  payload: {
    tool_name: string;
    tool_version: string;
    input_hash: string;
    output_hash: string;
    duration_ms: number;
    success: boolean;
    retry_count: number;
  };
}

interface AgentLoopIteration {
  event_type: 'agent.loop_iteration';
  payload: {
    iteration_number: number;
    perception_summary: string;
    reasoning_summary: string;
    action_taken: string;
    outcome: 'success' | 'failure' | 'retry';
  };
}

// Review Phase Events
interface ReviewStarted {
  event_type: 'review.started';
  payload: {
    files_changed: number;
    lines_changed: number;
    risk_score: number;
  };
}

interface FindingDetected {
  event_type: 'review.finding_detected';
  payload: {
    severity: 'low' | 'medium' | 'high' | 'critical';
    category: string;
    file: string;
    line: number;
    analyzer: string;
    confidence: number;
    message: string;
  };
}

interface ReviewCompleted {
  event_type: 'review.completed';
  payload: {
    findings_count: number;
    by_severity: Record<string, number>;
    approved: boolean;
    requires_human: boolean;
  };
}

// Testing Phase Events
interface TestExecutionStarted {
  event_type: 'test.execution_started';
  payload: {
    test_count: number;
    selection_strategy: string;
    estimated_duration_ms: number;
  };
}

interface TestCompleted {
  event_type: 'test.completed';
  payload: {
    test_id: string;
    test_name: string;
    suite: string;
    result: 'passed' | 'failed' | 'skipped' | 'flaky';
    duration_ms: number;
    assertions: number;
    error?: string;
  };
}

interface CoverageReported {
  event_type: 'test.coverage_reported';
  payload: {
    line_coverage: number;
    branch_coverage: number;
    function_coverage: number;
    uncovered_lines: number;
  };
}

// Deployment Phase Events
interface DeploymentStarted {
  event_type: 'deployment.started';
  payload: {
    deployment_id: string;
    strategy: 'canary' | 'blue_green' | 'rolling' | 'immediate';
    target_environment: string;
    artifact_version: string;
  };
}

interface TrafficShifted {
  event_type: 'deployment.traffic_shifted';
  payload: {
    deployment_id: string;
    previous_percent: number;
    new_percent: number;
    duration_ms: number;
  };
}

interface DeploymentValidated {
  event_type: 'deployment.validated';
  payload: {
    deployment_id: string;
    checks_passed: number;
    checks_failed: number;
    error_rate: number;
    latency_p95: number;
  };
}

interface RollbackInitiated {
  event_type: 'deployment.rollback_initiated';
  payload: {
    deployment_id: string;
    reason: string;
    trigger: 'automatic' | 'manual';
    metrics_at_trigger: Record<string, number>;
  };
}

// Monitoring Phase Events
interface AnomalyDetected {
  event_type: 'monitoring.anomaly_detected';
  payload: {
    metric: string;
    expected_value: number;
    actual_value: number;
    deviation_percent: number;
    severity: string;
  };
}

interface AlertFired {
  event_type: 'monitoring.alert_fired';
  payload: {
    alert_id: string;
    alert_name: string;
    severity: 'warning' | 'critical' | 'emergency';
    condition: string;
    value: number;
    threshold: number;
  };
}

// Orchestration Events
interface PhaseTransition {
  event_type: 'orchestration.phase_transition';
  payload: {
    from_phase: string;
    to_phase: string;
    checkpoint_id: string;
    duration_in_previous_ms: number;
  };
}

interface HumanInterventionRequested {
  event_type: 'orchestration.human_intervention_requested';
  payload: {
    reason: string;
    urgency: 'low' | 'medium' | 'high';
    estimated_response_time: number;
    context_summary: string;
  };
}

interface CheckpointCreated {
  event_type: 'orchestration.checkpoint_created';
  payload: {
    checkpoint_id: string;
    phase: string;
    state_size_bytes: number;
    validation_passed: boolean;
  };
}

2.3 Distributed Tracing

Trace Structure

Trace: sdlc_execution (root)
├── Span: planning_phase
│   ├── Span: requirements_analysis
│   ├── Span: architecture_design
│   └── Span: story_generation
├── Span: implementation_phase
│   ├── Span: module_implementation (parallel)
│   │   ├── Span: code_generation
│   │   ├── Span: self_validation
│   │   └── Span: iteration_loop (repeated)
│   └── Span: integration
├── Span: review_phase
│   ├── Span: static_analysis
│   ├── Span: security_scan
│   └── Span: ai_review
├── Span: testing_phase
│   ├── Span: test_selection
│   ├── Span: test_execution (parallel per worker)
│   └── Span: failure_analysis
├── Span: deployment_phase
│   ├── Span: build
│   ├── Span: canary_deploy
│   │   ├── Span: traffic_shift (repeated)
│   │   └── Span: health_check (repeated)
│   └── Span: full_rollout
└── Span: monitoring_phase
    ├── Span: synthetic_test
    └── Span: metric_collection

Span Attributes

typescript
interface SDLCSpanAttributes {
  // Standard OpenTelemetry attributes
  'service.name': string;
  'service.version': string;
  'deployment.environment': string;
  
  // SDLC-specific attributes
  'sdlc.project_id': string;
  'sdlc.phase': string;
  'sdlc.agent_id': string;
  'sdlc.agent_type': string;
  
  // Cost attributes
  'cost.tokens_prompt': number;
  'cost.tokens_completion': number;
  'cost.usd': number;
  
  // Performance attributes
  'performance.iteration_count': number;
  'performance.tool_calls': number;
  'performance.cache_hits': number;
}

2.4 Context Propagation

typescript
// W3C Trace Context propagation
interface TraceContext {
  traceparent: string;  // 00-{trace_id}-{span_id}-{flags}
  tracestate: string;   // Vendor-specific context
}

// SDLC-specific baggage
interface SDLCBaggage {
  'sdlc.project_id': string;
  'sdlc.session_id': string;
  'sdlc.human_owner': string;
  'sdlc.criticality': 'low' | 'medium' | 'high';
}

3. Storage Recommendations

3.1 Storage Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Hot Storage │  │  Warm Storage│  │  Cold Storage│          │
│  │  (Real-time) │  │  (Analytics) │  │  (Archive)   │          │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤          │
│  │ Prometheus   │  │ ClickHouse   │  │ S3/GCS       │          │
│  │ Redis        │  │ BigQuery     │  │ Glacier      │          │
│  │ InfluxDB     │  │ Snowflake    │  │              │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
│                                                                 │
│  ┌──────────────────────────────────────────────────┐          │
│  │              Event Store                         │          │
│  │  (Kafka / Pulsar / EventStoreDB)                │          │
│  └──────────────────────────────────────────────────┘          │
│                                                                 │
│  ┌──────────────────────────────────────────────────┐          │
│  │              Trace Store                         │          │
│  │  (Jaeger / Tempo / Honeycomb)                   │          │
│  └──────────────────────────────────────────────────┘          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.2 Storage by Data Type

Data Type	Primary Store	Retention	Query Pattern
Metrics (real-time)	Prometheus	15 days	Time-series aggregation
Metrics (long-term)	Thanos/Cortex	2 years	Historical trends
Metrics (analytics)	ClickHouse	5 years	OLAP queries
Logs (hot)	Loki/Elasticsearch	7 days	Full-text search
Logs (warm)	S3 + Athena	90 days	Infrequent queries
Logs (cold)	Glacier	7 years	Compliance only
Events	Kafka + ClickHouse	2 years	Event sourcing
Traces	Jaeger/Tempo	7 days	Distributed tracing
Traces (sampled)	Long-term store	30 days	Error analysis
Checkpoints	S3 + EFS	30 days	Recovery
Learnings/Memory	Vector DB (Pinecone)	Permanent	Semantic search

3.3 Schema Design

Time-Series Metrics Schema (ClickHouse)

sql
CREATE TABLE sdlc_metrics (
    timestamp DateTime64(9),
    metric_name LowCardinality(String),
    metric_value Float64,
    
    -- Labels
    project_id LowCardinality(String),
    phase LowCardinality(String),
    agent_id LowCardinality(String),
    agent_type LowCardinality(String),
    environment LowCardinality(String),
    
    -- Dynamic labels as Map
    labels Map(LowCardinality(String), String),
    
    -- Aggregation
    INDEX idx_project project_id TYPE bloom_filter GRANULARITY 4,
    INDEX idx_phase phase TYPE bloom_filter GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (metric_name, project_id, timestamp);

Events Schema (ClickHouse)

sql
CREATE TABLE sdlc_events (
    timestamp DateTime64(9),
    trace_id String,
    span_id String,
    parent_span_id String,
    
    event_type LowCardinality(String),
    level LowCardinality(String),
    
    agent_id LowCardinality(String),
    agent_type LowCardinality(String),
    phase LowCardinality(String),
    project_id LowCardinality(String),
    
    message String,
    payload JSON,
    
    -- Cost tracking
    cost_usd Float64,
    tokens_prompt UInt32,
    tokens_completion UInt32,
    
    -- Performance
    duration_ms UInt32,
    
    INDEX idx_trace trace_id TYPE bloom_filter GRANULARITY 4,
    INDEX idx_event_type event_type TYPE bloom_filter GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (event_type, project_id, timestamp);

Trace Schema (Jaeger/Tempo compatible)

yaml
trace:
  trace_id: string
  span_id: string
  parent_span_id: string
  
  service_name: string
  operation_name: string
  
  start_time: timestamp
  duration_ms: int
  
  tags:
    - key: string
      value: any
      type: string
  
  logs:
    - timestamp: timestamp
      fields: map<string, any>
  
  references:
    - ref_type: child_of | follows_from
      trace_id: string
      span_id: string

3.4 Sampling Strategies

yaml
sampling:
  # Head-based sampling for normal operations
  head_sampling:
    rate: 0.1  # 10% of traces
    
  # Tail-based sampling for interesting traces
  tail_sampling:
    policies:
      # Always sample errors
      - name: errors
        type: status_code
        status_codes: [ERROR]
        
      # Sample slow operations
      - name: slow_requests
        type: latency
        threshold_ms: 5000
        
      # Sample high-cost operations
      - name: expensive
        type: attribute
        key: cost.usd
        threshold: 0.10
        
      # Sample specific phases
      - name: critical_phases
        type: attribute
        key: sdlc.phase
        values: [deployment, review]

4. Query Patterns

4.1 Operational Queries

Real-Time Monitoring

promql
# Error rate by phase
sum(rate(sdlc_events_total{level="ERROR"}[5m])) by (phase)
/
sum(rate(sdlc_events_total[5m])) by (phase)

# Current active agents
sum(sdlc_active_agents) by (agent_type)

# Deployment success rate (last hour)
sum(rate(sdlc_deployment_completed_total{status="success"}[1h]))
/
sum(rate(sdlc_deployment_completed_total[1h]))

# Cost per project (current day)
sum(increase(agent_cost_total_usd[1d])) by (project_id)

Health Checks

sql
-- Stuck agents (no activity for >5 minutes)
SELECT 
    agent_id,
    agent_type,
    max(timestamp) as last_seen,
    now() - max(timestamp) as idle_duration
FROM sdlc_events
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY agent_id, agent_type
HAVING idle_duration > 300;

-- Failed deployments requiring attention
SELECT 
    deployment_id,
    project_id,
    timestamp,
    payload.reason
FROM sdlc_events
WHERE event_type = 'deployment.rollback_initiated'
  AND timestamp > now() - INTERVAL 24 HOURS
ORDER BY timestamp DESC;

4.2 Debugging Queries

Trace Analysis

sql
-- Find slow traces
SELECT 
    trace_id,
    duration_ms,
    phase,
    agent_id
FROM sdlc_spans
WHERE duration_ms > 60000
  AND timestamp > now() - INTERVAL 1 HOUR
ORDER BY duration_ms DESC
LIMIT 100;

-- Error trace details
SELECT 
    trace_id,
    span_id,
    event_type,
    message,
    payload.error
FROM sdlc_events
WHERE level = 'ERROR'
  AND trace_id IN (
    SELECT trace_id 
    FROM sdlc_events 
    WHERE event_type = 'orchestration.phase_transition'
      AND timestamp > now() - INTERVAL 1 HOUR
  )
ORDER BY timestamp;

Agent Behavior Analysis

sql
-- Agent loop efficiency
SELECT 
    agent_id,
    avg(iteration_count) as avg_iterations,
    avg(duration_ms) as avg_duration,
    countIf(outcome = 'success') / count() as success_rate
FROM sdlc_events
WHERE event_type = 'agent.loop_iteration'
  AND timestamp > now() - INTERVAL 24 HOURS
GROUP BY agent_id;

-- Tool usage patterns
SELECT 
    payload.tool_name,
    count() as call_count,
    avg(payload.duration_ms) as avg_latency,
    countIf(payload.success = false) / count() as error_rate
FROM sdlc_events
WHERE event_type = 'tool.executed'
  AND timestamp > now() - INTERVAL 7 DAYS
GROUP BY payload.tool_name
ORDER BY call_count DESC;

4.3 Analytics Queries

Performance Trends

sql
-- Phase duration trends
SELECT 
    toStartOfDay(timestamp) as day,
    phase,
    avg(duration_ms) / 1000 as avg_duration_seconds,
    quantile(0.95)(duration_ms) / 1000 as p95_duration_seconds
FROM sdlc_events
WHERE event_type LIKE '%.completed'
  AND timestamp > now() - INTERVAL 30 DAYS
GROUP BY day, phase
ORDER BY day, phase;

-- Cost trends by agent type
SELECT 
    toStartOfWeek(timestamp) as week,
    agent_type,
    sum(cost_usd) as total_cost,
    avg(cost_usd) as avg_per_operation
FROM sdlc_events
WHERE cost_usd > 0
  AND timestamp > now() - INTERVAL 90 DAYS
GROUP BY week, agent_type
ORDER BY week, total_cost DESC;

Quality Metrics

sql
-- Defect escape analysis
SELECT 
    review_phase.findings_count,
    post_deploy.defects_found,
    review_phase.findings_count > 0 as review_caught
FROM (
    SELECT 
        project_id,
        sum(payload.findings_count) as findings_count
    FROM sdlc_events
    WHERE event_type = 'review.completed'
    GROUP BY project_id
) review_phase
JOIN (
    SELECT 
        project_id,
        count() as defects_found
    FROM sdlc_events
    WHERE event_type = 'monitoring.defect_detected'
    GROUP BY project_id
) post_deploy
ON review_phase.project_id = post_deploy.project_id;

-- Test flakiness trends
SELECT 
    test_name,
    count() as total_runs,
    countIf(result = 'flaky') as flaky_runs,
    flaky_runs / total_runs as flakiness_rate
FROM sdlc_events
WHERE event_type = 'test.completed'
  AND timestamp > now() - INTERVAL 30 DAYS
GROUP BY test_name
HAVING flakiness_rate > 0.05
ORDER BY flakiness_rate DESC;

4.4 Cost Optimization Queries

sql
-- High-cost operations
SELECT 
    event_type,
    agent_type,
    count() as operation_count,
    sum(cost_usd) as total_cost,
    avg(cost_usd) as avg_cost,
    max(cost_usd) as max_cost
FROM sdlc_events
WHERE cost_usd > 0.01
  AND timestamp > now() - INTERVAL 7 DAYS
GROUP BY event_type, agent_type
ORDER BY total_cost DESC
LIMIT 50;

-- Cache effectiveness
SELECT 
    agent_type,
    sum(payload.cache_hits) as hits,
    sum(payload.cache_misses) as misses,
    hits / (hits + misses) as hit_ratio
FROM sdlc_events
WHERE event_type = 'agent.loop_iteration'
  AND timestamp > now() - INTERVAL 7 DAYS
GROUP BY agent_type;

5. Dashboard Requirements

5.1 Dashboard Hierarchy

Observability Dashboards
│
├── Executive Overview
│   ├── SDLC Velocity
│   ├── Cost Summary
│   └── Quality Scorecard
│
├── Operational
│   ├── Real-Time Pipeline Status
│   ├── Agent Health
│   └── System Performance
│
├── Phase-Specific
│   ├── Planning Dashboard
│   ├── Implementation Dashboard
│   ├── Review Dashboard
│   ├── Testing Dashboard
│   ├── Deployment Dashboard
│   └── Monitoring Dashboard
│
├── Debugging
│   ├── Trace Explorer
│   ├── Error Analysis
│   └── Cost Investigation
│
└── Analytics
    ├── Trends & Forecasting
    ├── Comparative Analysis
    └── Capacity Planning

5.2 Executive Overview Dashboard

Panel	Visualization	Data Source	Refresh
Lead Time Trend	Line chart	`sdlc.cicd.lead_time`	1h
Deployment Frequency	Bar chart	`sdlc.cicd.deployment.count`	1h
Change Failure Rate	Gauge	`sdlc.cicd.rollback_rate`	5m
MTTR	Stat panel	`sdlc.cicd.mean_time_to_recovery`	5m
Cost Per Project	Pie chart	`agent.cost.total`	1h
Quality Score	Scorecard	Composite metric	1h
Active Projects	Table	`sdlc.active_projects`	5m

5.3 Real-Time Pipeline Status

yaml
dashboard:
  title: "SDLC Pipeline Status"
  refresh: 5s
  
  panels:
    # Global health
    - title: "Active Executions"
      type: stat
      query: count(sdlc_active_executions)
      
    - title: "Queue Depth"
      type: gauge
      query: avg(sdlc_queue_depth)
      thresholds: [10, 50, 100]
      
    # Phase status
    - title: "Phase Distribution"
      type: pie
      query: |
        SELECT phase, count() 
        FROM sdlc_events 
        WHERE event_type = 'orchestration.phase_transition'
          AND timestamp > now() - INTERVAL 1 HOUR
        GROUP BY phase
      
    # Error heatmap
    - title: "Error Rate Heatmap"
      type: heatmap
      query: |
        SELECT 
          toStartOfFiveMinute(timestamp) as time,
          phase,
          count() as errors
        FROM sdlc_events
        WHERE level = 'ERROR'
        GROUP BY time, phase
      
    # Cost rate
    - title: "Real-Time Cost"
      type: graph
      query: |
        rate(agent_cost_total_usd[5m])

5.4 Agent Health Dashboard

Panel	Metric	Alert Threshold
Active Agents by Type	`sdlc_active_agents`	-
Agent Loop Duration	`agent.loop_iteration.duration`	p99 > 30s
Tool Error Rate	`agent.tool_calls` (error/total)	> 5%
Token Usage Rate	`agent.token_usage`	-
Stuck Agents	Custom query	> 0
Agent Decisions/Min	`agent.decision.count`	-
Learning Rate	`agent.learning.patterns_learned`	-

5.5 Phase-Specific Dashboards

Implementation Dashboard

yaml
panels:
  - title: "Code Generation Rate"
    query: rate(sdlc.implementation.lines_written[5m])
    
  - title: "Validation Success Rate"
    query: |
      1 - (
        rate(agent.implementer.validation_failures[5m])
        /
        rate(sdlc.implementation.modules_completed[5m])
      )
      
  - title: "Tool Usage Distribution"
    type: pie
    query: |
      SELECT tool_name, count() 
      FROM sdlc_events 
      WHERE event_type = 'tool.executed'
      GROUP BY tool_name
      
  - title: "Iteration Count Distribution"
    type: histogram
    query: |
      SELECT iteration_count, count()
      FROM sdlc_events
      WHERE event_type = 'agent.loop_iteration'
      GROUP BY iteration_count

Deployment Dashboard

yaml
panels:
  - title: "Deployment Success Rate"
    query: sdlc.cicd.success_rate
    
  - title: "Canary Health"
    query: |
      SELECT 
        deployment_id,
        error_rate,
        latency_p95
      FROM sdlc_events
      WHERE event_type = 'deployment.validated'
      ORDER BY timestamp DESC
      LIMIT 10
      
  - title: "Rollback Timeline"
    type: timeline
    query: |
      SELECT timestamp, deployment_id, reason
      FROM sdlc_events
      WHERE event_type = 'deployment.rollback_initiated'
      
  - title: "Traffic Distribution"
    type: stacked_area
    query: |
      SELECT 
        timestamp,
        deployment_id,
        payload.new_percent as traffic_percent
      FROM sdlc_events
      WHERE event_type = 'deployment.traffic_shifted'

6. Alerting Thresholds

6.1 Severity Levels

Level	Response Time	Notification	Examples
P1 - Critical	Immediate	Page/SMS/Voice	System down, security breach, data loss
P2 - High	15 minutes	Slack/Email	High error rate, deployment failure
P3 - Medium	1 hour	Slack	Elevated latency, cost spike
P4 - Low	4 hours	Email	Minor degradation, warning threshold
P5 - Info	Next business day	Dashboard	Trends, recommendations

6.2 Alert Rules by Category

System Health Alerts

yaml
alerts:
  - name: AgentStuck
    severity: P2
    condition: |
      max_over_time(
        (time() - sdlc_agent_last_activity_timestamp)[5m:]
      ) > 300
    for: 2m
    annotations:
      summary: "Agent {{ $labels.agent_id }} has been stuck for 5+ minutes"
      
  - name: HighErrorRate
    severity: P1
    condition: |
      (
        sum(rate(sdlc_events_total{level="ERROR"}[5m])) 
        / 
        sum(rate(sdlc_events_total[5m]))
      ) > 0.1
    for: 2m
    annotations:
      summary: "Error rate above 10%"
      
  - name: QueueBackup
    severity: P2
    condition: sdlc_queue_depth > 50
    for: 5m
    annotations:
      summary: "Pipeline queue backing up ({{ $value }} items)"

Performance Alerts

yaml
alerts:
  - name: SlowAgentLoop
    severity: P3
    condition: |
      histogram_quantile(0.99, 
        sum(rate(agent_loop_duration_bucket[5m])) by (le, agent_type)
      ) > 30
    for: 5m
    annotations:
      summary: "P99 agent loop duration > 30s for {{ $labels.agent_type }}"
      
  - name: HighLatency
    severity: P3
    condition: |
      histogram_quantile(0.95, 
        sum(rate(sdlc_phase_duration_bucket[5m])) by (le, phase)
      ) > 300
    for: 10m
    annotations:
      summary: "P95 phase duration > 5 minutes for {{ $labels.phase }}"
      
  - name: CacheHitRateLow
    severity: P4
    condition: |
      (
        sum(rate(agent_cache_hits[5m]))
        /
        sum(rate(agent_cache_operations[5m]))
      ) < 0.5
    for: 15m
    annotations:
      summary: "Cache hit rate below 50%"

Cost Alerts

yaml
alerts:
  - name: CostSpike
    severity: P3
    condition: |
      (
        sum(increase(agent_cost_total_usd[1h]))
        >
        2 * sum(increase(agent_cost_total_usd[1h] offset 24h))
      )
    for: 15m
    annotations:
      summary: "Cost spike detected: 2x normal hourly rate"
      
  - name: HighCostProject
    severity: P4
    condition: |
      sum(increase(agent_cost_total_usd[24h])) by (project_id) > 100
    for: 1h
    annotations:
      summary: "Project {{ $labels.project_id }} exceeded $100/day"
      
  - name: ExpensiveOperation
    severity: P4
    condition: agent_cost_per_operation > 0.5
    for: 0m
    annotations:
      summary: "Single operation cost > $0.50"

Quality Alerts

yaml
alerts:
  - name: DeploymentFailure
    severity: P1
    condition: |
      increase(sdlc_deployment_completed_total{status="failed"}[5m]) > 0
    for: 0m
    annotations:
      summary: "Deployment failure detected"
      
  - name: HighRollbackRate
    severity: P2
    condition: |
      (
        sum(rate(sdlc_cicd_rollback_total[1h]))
        /
        sum(rate(sdlc_cicd_deployment_total[1h]))
      ) > 0.05
    for: 10m
    annotations:
      summary: "Rollback rate above 5%"
      
  - name: TestPassRateLow
    severity: P2
    condition: sdlc_testing_pass_rate < 0.9
    for: 5m
    annotations:
      summary: "Test pass rate below 90%"
      
  - name: FlakyTestsDetected
    severity: P4
    condition: sdlc_testing_flakiness_rate > 0.1
    for: 1h
    annotations:
      summary: "Flaky test rate above 10%"

Security Alerts

yaml
alerts:
  - name: SecurityFindingCritical
    severity: P1
    condition: |
      increase(sdlc_review_findings_total{severity="critical"}[5m]) > 0
    for: 0m
    annotations:
      summary: "Critical security finding detected"
      
  - name: SecretsExposed
    severity: P1
    condition: |
      increase(sdlc_cicd_security_scan_secrets_found[5m]) > 0
    for: 0m
    annotations:
      summary: "Potential secrets exposed in code"

6.3 Alert Routing

yaml
routing:
  default: team-sdlc-oncall
  
  routes:
    - match:
        severity: P1
      receiver: pagerduty-critical
      continue: true
      
    - match:
        severity: P2
      receiver: slack-alerts-high
      
    - match:
        severity: P3
      receiver: slack-alerts-medium
      
    - match:
        alertname: CostSpike
      receiver: finance-team
      
    - match:
        alertname: SecretsExposed
      receiver: security-team
      
    - match:
        agent_type: deployer
      receiver: platform-team
      
receivers:
  pagerduty-critical:
    pagerduty_configs:
      - service_key: "${PAGERDUTY_KEY}"
        severity: critical
        
  slack-alerts-high:
    slack_configs:
      - channel: "#sdlc-alerts-high"
        send_resolved: true
        
  slack-alerts-medium:
    slack_configs:
      - channel: "#sdlc-alerts-medium"
        send_resolved: true
        
  finance-team:
    email_configs:
      - to: "finance@company.com"
        
  security-team:
    slack_configs:
      - channel: "#security-incidents"
        send_resolved: true

6.4 Alert Suppression

yaml
inhibition_rules:
  # Suppress lower-severity alerts when critical is firing
  - source_match:
      severity: P1
    target_match:
      severity: P2
    equal: ['project_id', 'phase']
    
  # Suppress agent-specific alerts when orchestrator is down
  - source_match:
      alertname: OrchestratorDown
    target_match_re:
      alertname: Agent.*
    equal: ['environment']
    
silences:
  # Maintenance windows
  - matchers:
      - name: environment
        value: staging
    startsAt: "2026-02-07T02:00:00Z"
    endsAt: "2026-02-07T04:00:00Z"
    comment: "Scheduled maintenance"

7. Implementation Checklist

Phase 1: Foundation

Deploy Prometheus for metrics collection
Deploy Loki for log aggregation
Deploy Jaeger/Tempo for distributed tracing
Configure OpenTelemetry SDK in agents
Set up Kafka for event streaming

Phase 2: Storage

Deploy ClickHouse for analytics
Configure S3 lifecycle policies
Set up Thanos for long-term metric storage
Deploy vector database for agent memory

Phase 3: Visualization

Deploy Grafana
Create executive dashboard
Create operational dashboards
Create phase-specific dashboards

Phase 4: Alerting

Deploy Alertmanager
Configure PagerDuty integration
Set up Slack notifications
Define alert routing rules
Create runbooks for each alert

Phase 5: Optimization

Implement sampling strategies
Tune retention policies
Optimize query performance
Set up cost monitoring
Create anomaly detection

8. References

Related Research

External Standards

OpenTelemetry Specification
Prometheus Best Practices
W3C Distributed Tracing
CloudEvents Specification

Generated: 2026-02-06
Version: 1.0