19 min
devops
February 8, 2026

Observability and Telemetry System Design

Observability and Telemetry System Design

Overview

This document specifies the observability infrastructure for agentic software development systems. It covers metrics, logging, tracing, dashboards, and alerting across all SDLC phases.

Date: 2026-02-06
Scope: End-to-end SDLC observability
Status: Specification


1. Metrics Architecture

1.1 Metric Categories

CategoryDescriptionCardinality
PerformanceTiming, latency, throughputMedium
ReliabilityErrors, failures, success ratesLow
EfficiencyCost, resource utilization, token usageMedium
QualityTest results, coverage, defectsMedium
AgentLoop iterations, decisions, learningHigh
BusinessLead time, deployment frequencyLow

1.2 Metric Definitions by Phase

Phase 1: Planning & Requirements

Metric NameTypeUnitDescriptionLabels
sdlc.planning.durationHistogramsecondsTime to complete planning phaseproject, complexity
sdlc.planning.iterationsCountercountNumber of refinement iterationsproject
sdlc.planning.human_interventionsCountercountTimes human input was requestedreason
sdlc.planning.stories_generatedCountercountUser stories createdproject
sdlc.planning.architecture_optionsGaugecountArchitecture alternatives consideredproject
agent.planner.token_usageCountertokensLLM tokens consumedmodel, operation
agent.planner.cache_hit_ratioGaugeratioCache effectivenessmodel

Phase 2: Implementation

Metric NameTypeUnitDescriptionLabels
sdlc.implementation.durationHistogramsecondsTime to implementproject, module_count
sdlc.implementation.lines_writtenCounterlinesCode lines generatedlanguage, module
sdlc.implementation.lines_modifiedCounterlinesCode lines changedlanguage, operation
sdlc.implementation.modules_completedCountercountModules successfully implementedproject
sdlc.implementation.parallel_agentsGaugecountConcurrent worker agentsproject
agent.implementer.loop_iterationsCountercountAgent loop iterations per tasktask_type, outcome
agent.implementer.tool_callsCountercountTool invocationstool_name, status
agent.implementer.tool_latencyHistogrammillisecondsTool execution timetool_name
agent.implementer.validation_failuresCountercountSelf-validation failuresfailure_type
agent.implementer.cost_per_taskHistogramUSDCost per implementation taskcomplexity
code.complexity.cyclomaticGaugescoreCyclomatic complexitymodule, function
code.quality.maintainability_indexGaugescoreMaintainability scoremodule

Phase 3: Code Review

Metric NameTypeUnitDescriptionLabels
sdlc.review.durationHistogramsecondsTime to complete reviewreview_type
sdlc.review.comments_generatedCountercountTotal review commentsseverity, analyzer
sdlc.review.comments_per_lineGaugeratioComment densityfile
sdlc.review.false_positive_rateGaugeratioIncorrect AI suggestionsanalyzer
sdlc.review.human_review_timeHistogramsecondsHuman reviewer timerisk_level
agent.reviewer.analysis_latencyHistogrammillisecondsAI review response timefile_size
agent.reviewer.confidence_scoreGaugescoreAI confidence in findingsfile
agent.reviewer.suggestions_appliedCountercountAuto-fixes appliedsuggestion_type
quality.gate.passedCountercountGate check passesgate_name
quality.gate.failedCountercountGate check failuresgate_name, reason

Phase 4: Testing

Metric NameTypeUnitDescriptionLabels
sdlc.testing.durationHistogramsecondsTotal test execution timetest_suite
sdlc.testing.tests_executedCountercountTotal tests runtype, selector
sdlc.testing.tests_selectedGaugeratioPercentage of test suite runselection_strategy
sdlc.testing.pass_rateGaugeratioTest pass percentagesuite
sdlc.testing.coverage.lineGaugepercentLine coveragemodule
sdlc.testing.coverage.branchGaugepercentBranch coveragemodule
sdlc.testing.coverage.functionGaugepercentFunction coveragemodule
sdlc.testing.flakiness_rateGaugeratioFlaky test percentagetest_name
agent.tester.generation_timeHistogramsecondsTime to generate teststarget_type
agent.tester.tests_generatedCountercountTests auto-generatedgeneration_type
test.execution.durationHistogramsecondsIndividual test durationtest_name
test.failure.analysis_timeHistogramsecondsRCA analysis durationfailure_type

Phase 5: CI/CD & Deployment

Metric NameTypeUnitDescriptionLabels
sdlc.cicd.lead_timeHistogramsecondsCommit to productionproject
sdlc.cicd.cycle_timeHistogramsecondsPipeline execution timepipeline_id
sdlc.cicd.queue_timeHistogramsecondsTime waiting for resourcesrunner_type
sdlc.cicd.success_rateGaugeratioPipeline success ratepipeline_id
sdlc.cicd.rollback_rateGaugeratioPercentage of rollbacksproject
sdlc.cicd.mean_time_to_recoveryHistogramsecondsRecovery from failurefailure_type
sdlc.deploy.canary_durationHistogramsecondsCanary observation windowdeployment_id
sdlc.deploy.traffic_shift_durationHistogramsecondsTime to shift trafficstrategy
pipeline.cache.hit_rateGaugeratioBuild cache effectivenesscache_type
pipeline.parallelization.efficiencyGaugeratioWorker utilizationpipeline_id
deployment.validation.durationHistogramsecondsPost-deploy validationcheck_type
deployment.error_rate.canaryGaugeratioError rate during canarydeployment_id

Phase 6: Monitoring & Feedback

Metric NameTypeUnitDescriptionLabels
sdlc.monitoring.alert_countCountercountAlerts firedseverity, type
sdlc.monitoring.mean_time_to_detectHistogramsecondsTime to detect issuesdetection_method
agent.monitor.anomalies_detectedCountercountAnomalies foundseverity
agent.monitor.synthetic_pass_rateGaugeratioSynthetic test successendpoint
sdlc.feedback.defect_escape_rateGaugeratioDefects found post-deployseverity
sdlc.feedback.user_satisfactionGaugescoreUser feedback scoresfeature

Cross-Cutting: Agent Behavior

Metric NameTypeUnitDescriptionLabels
agent.decision.countCountercountDecisions madeagent, decision_type
agent.decision.confidenceGaugescoreConfidence in decisionsagent
agent.decision.human_overrideCountercountHuman overridesagent, reason
agent.learning.patterns_learnedCountercountNew patterns storedpattern_type
agent.learning.reflectionsCountercountReflection cyclestrigger_type
agent.memory.operationsCountercountMemory read/writeoperation
agent.orchestrator.handoffsCountercountAgent handoffsfrom_agent, to_agent
agent.cost.totalCounterUSDTotal LLM costagent, model
agent.cost.per_requestHistogramUSDCost per requestagent

1.3 Metric Collection Methods

yaml
collection_methods: # Push-based: Agents send metrics directly push: - agent_internal_metrics # Built-in agent telemetry - custom_business_metrics # Application-specific # Pull-based: Scraped by collector pull: - prometheus_exporters # Standard exporters - application_endpoints # /metrics endpoints # Derived: Computed from other metrics derived: - rate_calculations # rates_over_time() - aggregations # sum by (label) - ratios # error_rate / total

2. Logging and Tracing Strategy

2.1 Structured Logging

Log Levels

LevelUsageRetention
DEBUGDetailed agent reasoning, tool inputs/outputs7 days
INFONormal operations, phase transitions30 days
WARNDegraded performance, retry attempts90 days
ERRORFailures, exceptions, circuit breaker triggers1 year
CRITICALSafety stops, human escalationPermanent

Log Schema

typescript
interface AgentLogEntry { // Identity timestamp: string; // ISO 8601 with nanoseconds trace_id: string; // W3C trace context span_id: string; parent_span_id?: string; // Source agent_id: string; agent_type: 'planner' | 'implementer' | 'reviewer' | 'tester' | 'deployer' | 'monitor' | 'orchestrator'; phase: 'planning' | 'implementation' | 'review' | 'testing' | 'deployment' | 'monitoring'; version: string; // Content level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'CRITICAL'; message: string; event_type: string; // Context context: { project_id: string; task_id: string; iteration?: number; session_id: string; }; // Payload payload?: { input?: unknown; output?: unknown; duration_ms?: number; token_usage?: { prompt: number; completion: number; total: number; }; cost_usd?: number; }; // Metadata labels: Record<string, string>; }

2.2 Event Schemas

Event Types by Phase

typescript
// Planning Phase Events interface PlanningStarted { event_type: 'planning.started'; payload: { requirements_summary: string; estimated_complexity: 'low' | 'medium' | 'high'; }; } interface PlanningCompleted { event_type: 'planning.completed'; payload: { stories_count: number; architecture_selected: string; duration_ms: number; }; } // Implementation Phase Events interface ImplementationStarted { event_type: 'implementation.started'; payload: { module_count: number; parallel_workers: number; }; } interface ToolExecution { event_type: 'tool.executed'; payload: { tool_name: string; tool_version: string; input_hash: string; output_hash: string; duration_ms: number; success: boolean; retry_count: number; }; } interface AgentLoopIteration { event_type: 'agent.loop_iteration'; payload: { iteration_number: number; perception_summary: string; reasoning_summary: string; action_taken: string; outcome: 'success' | 'failure' | 'retry'; }; } // Review Phase Events interface ReviewStarted { event_type: 'review.started'; payload: { files_changed: number; lines_changed: number; risk_score: number; }; } interface FindingDetected { event_type: 'review.finding_detected'; payload: { severity: 'low' | 'medium' | 'high' | 'critical'; category: string; file: string; line: number; analyzer: string; confidence: number; message: string; }; } interface ReviewCompleted { event_type: 'review.completed'; payload: { findings_count: number; by_severity: Record<string, number>; approved: boolean; requires_human: boolean; }; } // Testing Phase Events interface TestExecutionStarted { event_type: 'test.execution_started'; payload: { test_count: number; selection_strategy: string; estimated_duration_ms: number; }; } interface TestCompleted { event_type: 'test.completed'; payload: { test_id: string; test_name: string; suite: string; result: 'passed' | 'failed' | 'skipped' | 'flaky'; duration_ms: number; assertions: number; error?: string; }; } interface CoverageReported { event_type: 'test.coverage_reported'; payload: { line_coverage: number; branch_coverage: number; function_coverage: number; uncovered_lines: number; }; } // Deployment Phase Events interface DeploymentStarted { event_type: 'deployment.started'; payload: { deployment_id: string; strategy: 'canary' | 'blue_green' | 'rolling' | 'immediate'; target_environment: string; artifact_version: string; }; } interface TrafficShifted { event_type: 'deployment.traffic_shifted'; payload: { deployment_id: string; previous_percent: number; new_percent: number; duration_ms: number; }; } interface DeploymentValidated { event_type: 'deployment.validated'; payload: { deployment_id: string; checks_passed: number; checks_failed: number; error_rate: number; latency_p95: number; }; } interface RollbackInitiated { event_type: 'deployment.rollback_initiated'; payload: { deployment_id: string; reason: string; trigger: 'automatic' | 'manual'; metrics_at_trigger: Record<string, number>; }; } // Monitoring Phase Events interface AnomalyDetected { event_type: 'monitoring.anomaly_detected'; payload: { metric: string; expected_value: number; actual_value: number; deviation_percent: number; severity: string; }; } interface AlertFired { event_type: 'monitoring.alert_fired'; payload: { alert_id: string; alert_name: string; severity: 'warning' | 'critical' | 'emergency'; condition: string; value: number; threshold: number; }; } // Orchestration Events interface PhaseTransition { event_type: 'orchestration.phase_transition'; payload: { from_phase: string; to_phase: string; checkpoint_id: string; duration_in_previous_ms: number; }; } interface HumanInterventionRequested { event_type: 'orchestration.human_intervention_requested'; payload: { reason: string; urgency: 'low' | 'medium' | 'high'; estimated_response_time: number; context_summary: string; }; } interface CheckpointCreated { event_type: 'orchestration.checkpoint_created'; payload: { checkpoint_id: string; phase: string; state_size_bytes: number; validation_passed: boolean; }; }

2.3 Distributed Tracing

Trace Structure

Trace: sdlc_execution (root)
├── Span: planning_phase
│   ├── Span: requirements_analysis
│   ├── Span: architecture_design
│   └── Span: story_generation
├── Span: implementation_phase
│   ├── Span: module_implementation (parallel)
│   │   ├── Span: code_generation
│   │   ├── Span: self_validation
│   │   └── Span: iteration_loop (repeated)
│   └── Span: integration
├── Span: review_phase
│   ├── Span: static_analysis
│   ├── Span: security_scan
│   └── Span: ai_review
├── Span: testing_phase
│   ├── Span: test_selection
│   ├── Span: test_execution (parallel per worker)
│   └── Span: failure_analysis
├── Span: deployment_phase
│   ├── Span: build
│   ├── Span: canary_deploy
│   │   ├── Span: traffic_shift (repeated)
│   │   └── Span: health_check (repeated)
│   └── Span: full_rollout
└── Span: monitoring_phase
    ├── Span: synthetic_test
    └── Span: metric_collection

Span Attributes

typescript
interface SDLCSpanAttributes { // Standard OpenTelemetry attributes 'service.name': string; 'service.version': string; 'deployment.environment': string; // SDLC-specific attributes 'sdlc.project_id': string; 'sdlc.phase': string; 'sdlc.agent_id': string; 'sdlc.agent_type': string; // Cost attributes 'cost.tokens_prompt': number; 'cost.tokens_completion': number; 'cost.usd': number; // Performance attributes 'performance.iteration_count': number; 'performance.tool_calls': number; 'performance.cache_hits': number; }

2.4 Context Propagation

typescript
// W3C Trace Context propagation interface TraceContext { traceparent: string; // 00-{trace_id}-{span_id}-{flags} tracestate: string; // Vendor-specific context } // SDLC-specific baggage interface SDLCBaggage { 'sdlc.project_id': string; 'sdlc.session_id': string; 'sdlc.human_owner': string; 'sdlc.criticality': 'low' | 'medium' | 'high'; }

3. Storage Recommendations

3.1 Storage Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Hot Storage │  │  Warm Storage│  │  Cold Storage│          │
│  │  (Real-time) │  │  (Analytics) │  │  (Archive)   │          │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤          │
│  │ Prometheus   │  │ ClickHouse   │  │ S3/GCS       │          │
│  │ Redis        │  │ BigQuery     │  │ Glacier      │          │
│  │ InfluxDB     │  │ Snowflake    │  │              │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
│                                                                 │
│  ┌──────────────────────────────────────────────────┐          │
│  │              Event Store                         │          │
│  │  (Kafka / Pulsar / EventStoreDB)                │          │
│  └──────────────────────────────────────────────────┘          │
│                                                                 │
│  ┌──────────────────────────────────────────────────┐          │
│  │              Trace Store                         │          │
│  │  (Jaeger / Tempo / Honeycomb)                   │          │
│  └──────────────────────────────────────────────────┘          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.2 Storage by Data Type

Data TypePrimary StoreRetentionQuery Pattern
Metrics (real-time)Prometheus15 daysTime-series aggregation
Metrics (long-term)Thanos/Cortex2 yearsHistorical trends
Metrics (analytics)ClickHouse5 yearsOLAP queries
Logs (hot)Loki/Elasticsearch7 daysFull-text search
Logs (warm)S3 + Athena90 daysInfrequent queries
Logs (cold)Glacier7 yearsCompliance only
EventsKafka + ClickHouse2 yearsEvent sourcing
TracesJaeger/Tempo7 daysDistributed tracing
Traces (sampled)Long-term store30 daysError analysis
CheckpointsS3 + EFS30 daysRecovery
Learnings/MemoryVector DB (Pinecone)PermanentSemantic search

3.3 Schema Design

Time-Series Metrics Schema (ClickHouse)

sql
CREATE TABLE sdlc_metrics ( timestamp DateTime64(9), metric_name LowCardinality(String), metric_value Float64, -- Labels project_id LowCardinality(String), phase LowCardinality(String), agent_id LowCardinality(String), agent_type LowCardinality(String), environment LowCardinality(String), -- Dynamic labels as Map labels Map(LowCardinality(String), String), -- Aggregation INDEX idx_project project_id TYPE bloom_filter GRANULARITY 4, INDEX idx_phase phase TYPE bloom_filter GRANULARITY 4 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp) ORDER BY (metric_name, project_id, timestamp);

Events Schema (ClickHouse)

sql
CREATE TABLE sdlc_events ( timestamp DateTime64(9), trace_id String, span_id String, parent_span_id String, event_type LowCardinality(String), level LowCardinality(String), agent_id LowCardinality(String), agent_type LowCardinality(String), phase LowCardinality(String), project_id LowCardinality(String), message String, payload JSON, -- Cost tracking cost_usd Float64, tokens_prompt UInt32, tokens_completion UInt32, -- Performance duration_ms UInt32, INDEX idx_trace trace_id TYPE bloom_filter GRANULARITY 4, INDEX idx_event_type event_type TYPE bloom_filter GRANULARITY 4 ) ENGINE = MergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (event_type, project_id, timestamp);

Trace Schema (Jaeger/Tempo compatible)

yaml
trace: trace_id: string span_id: string parent_span_id: string service_name: string operation_name: string start_time: timestamp duration_ms: int tags: - key: string value: any type: string logs: - timestamp: timestamp fields: map<string, any> references: - ref_type: child_of | follows_from trace_id: string span_id: string

3.4 Sampling Strategies

yaml
sampling: # Head-based sampling for normal operations head_sampling: rate: 0.1 # 10% of traces # Tail-based sampling for interesting traces tail_sampling: policies: # Always sample errors - name: errors type: status_code status_codes: [ERROR] # Sample slow operations - name: slow_requests type: latency threshold_ms: 5000 # Sample high-cost operations - name: expensive type: attribute key: cost.usd threshold: 0.10 # Sample specific phases - name: critical_phases type: attribute key: sdlc.phase values: [deployment, review]

4. Query Patterns

4.1 Operational Queries

Real-Time Monitoring

promql
# Error rate by phase sum(rate(sdlc_events_total{level="ERROR"}[5m])) by (phase) / sum(rate(sdlc_events_total[5m])) by (phase) # Current active agents sum(sdlc_active_agents) by (agent_type) # Deployment success rate (last hour) sum(rate(sdlc_deployment_completed_total{status="success"}[1h])) / sum(rate(sdlc_deployment_completed_total[1h])) # Cost per project (current day) sum(increase(agent_cost_total_usd[1d])) by (project_id)

Health Checks

sql
-- Stuck agents (no activity for >5 minutes) SELECT agent_id, agent_type, max(timestamp) as last_seen, now() - max(timestamp) as idle_duration FROM sdlc_events WHERE timestamp > now() - INTERVAL 1 HOUR GROUP BY agent_id, agent_type HAVING idle_duration > 300; -- Failed deployments requiring attention SELECT deployment_id, project_id, timestamp, payload.reason FROM sdlc_events WHERE event_type = 'deployment.rollback_initiated' AND timestamp > now() - INTERVAL 24 HOURS ORDER BY timestamp DESC;

4.2 Debugging Queries

Trace Analysis

sql
-- Find slow traces SELECT trace_id, duration_ms, phase, agent_id FROM sdlc_spans WHERE duration_ms > 60000 AND timestamp > now() - INTERVAL 1 HOUR ORDER BY duration_ms DESC LIMIT 100; -- Error trace details SELECT trace_id, span_id, event_type, message, payload.error FROM sdlc_events WHERE level = 'ERROR' AND trace_id IN ( SELECT trace_id FROM sdlc_events WHERE event_type = 'orchestration.phase_transition' AND timestamp > now() - INTERVAL 1 HOUR ) ORDER BY timestamp;

Agent Behavior Analysis

sql
-- Agent loop efficiency SELECT agent_id, avg(iteration_count) as avg_iterations, avg(duration_ms) as avg_duration, countIf(outcome = 'success') / count() as success_rate FROM sdlc_events WHERE event_type = 'agent.loop_iteration' AND timestamp > now() - INTERVAL 24 HOURS GROUP BY agent_id; -- Tool usage patterns SELECT payload.tool_name, count() as call_count, avg(payload.duration_ms) as avg_latency, countIf(payload.success = false) / count() as error_rate FROM sdlc_events WHERE event_type = 'tool.executed' AND timestamp > now() - INTERVAL 7 DAYS GROUP BY payload.tool_name ORDER BY call_count DESC;

4.3 Analytics Queries

Performance Trends

sql
-- Phase duration trends SELECT toStartOfDay(timestamp) as day, phase, avg(duration_ms) / 1000 as avg_duration_seconds, quantile(0.95)(duration_ms) / 1000 as p95_duration_seconds FROM sdlc_events WHERE event_type LIKE '%.completed' AND timestamp > now() - INTERVAL 30 DAYS GROUP BY day, phase ORDER BY day, phase; -- Cost trends by agent type SELECT toStartOfWeek(timestamp) as week, agent_type, sum(cost_usd) as total_cost, avg(cost_usd) as avg_per_operation FROM sdlc_events WHERE cost_usd > 0 AND timestamp > now() - INTERVAL 90 DAYS GROUP BY week, agent_type ORDER BY week, total_cost DESC;

Quality Metrics

sql
-- Defect escape analysis SELECT review_phase.findings_count, post_deploy.defects_found, review_phase.findings_count > 0 as review_caught FROM ( SELECT project_id, sum(payload.findings_count) as findings_count FROM sdlc_events WHERE event_type = 'review.completed' GROUP BY project_id ) review_phase JOIN ( SELECT project_id, count() as defects_found FROM sdlc_events WHERE event_type = 'monitoring.defect_detected' GROUP BY project_id ) post_deploy ON review_phase.project_id = post_deploy.project_id; -- Test flakiness trends SELECT test_name, count() as total_runs, countIf(result = 'flaky') as flaky_runs, flaky_runs / total_runs as flakiness_rate FROM sdlc_events WHERE event_type = 'test.completed' AND timestamp > now() - INTERVAL 30 DAYS GROUP BY test_name HAVING flakiness_rate > 0.05 ORDER BY flakiness_rate DESC;

4.4 Cost Optimization Queries

sql
-- High-cost operations SELECT event_type, agent_type, count() as operation_count, sum(cost_usd) as total_cost, avg(cost_usd) as avg_cost, max(cost_usd) as max_cost FROM sdlc_events WHERE cost_usd > 0.01 AND timestamp > now() - INTERVAL 7 DAYS GROUP BY event_type, agent_type ORDER BY total_cost DESC LIMIT 50; -- Cache effectiveness SELECT agent_type, sum(payload.cache_hits) as hits, sum(payload.cache_misses) as misses, hits / (hits + misses) as hit_ratio FROM sdlc_events WHERE event_type = 'agent.loop_iteration' AND timestamp > now() - INTERVAL 7 DAYS GROUP BY agent_type;

5. Dashboard Requirements

5.1 Dashboard Hierarchy

Observability Dashboards
│
├── Executive Overview
│   ├── SDLC Velocity
│   ├── Cost Summary
│   └── Quality Scorecard
│
├── Operational
│   ├── Real-Time Pipeline Status
│   ├── Agent Health
│   └── System Performance
│
├── Phase-Specific
│   ├── Planning Dashboard
│   ├── Implementation Dashboard
│   ├── Review Dashboard
│   ├── Testing Dashboard
│   ├── Deployment Dashboard
│   └── Monitoring Dashboard
│
├── Debugging
│   ├── Trace Explorer
│   ├── Error Analysis
│   └── Cost Investigation
│
└── Analytics
    ├── Trends & Forecasting
    ├── Comparative Analysis
    └── Capacity Planning

5.2 Executive Overview Dashboard

PanelVisualizationData SourceRefresh
Lead Time TrendLine chartsdlc.cicd.lead_time1h
Deployment FrequencyBar chartsdlc.cicd.deployment.count1h
Change Failure RateGaugesdlc.cicd.rollback_rate5m
MTTRStat panelsdlc.cicd.mean_time_to_recovery5m
Cost Per ProjectPie chartagent.cost.total1h
Quality ScoreScorecardComposite metric1h
Active ProjectsTablesdlc.active_projects5m

5.3 Real-Time Pipeline Status

yaml
dashboard: title: "SDLC Pipeline Status" refresh: 5s panels: # Global health - title: "Active Executions" type: stat query: count(sdlc_active_executions) - title: "Queue Depth" type: gauge query: avg(sdlc_queue_depth) thresholds: [10, 50, 100] # Phase status - title: "Phase Distribution" type: pie query: | SELECT phase, count() FROM sdlc_events WHERE event_type = 'orchestration.phase_transition' AND timestamp > now() - INTERVAL 1 HOUR GROUP BY phase # Error heatmap - title: "Error Rate Heatmap" type: heatmap query: | SELECT toStartOfFiveMinute(timestamp) as time, phase, count() as errors FROM sdlc_events WHERE level = 'ERROR' GROUP BY time, phase # Cost rate - title: "Real-Time Cost" type: graph query: | rate(agent_cost_total_usd[5m])

5.4 Agent Health Dashboard

PanelMetricAlert Threshold
Active Agents by Typesdlc_active_agents-
Agent Loop Durationagent.loop_iteration.durationp99 > 30s
Tool Error Rateagent.tool_calls (error/total)> 5%
Token Usage Rateagent.token_usage-
Stuck AgentsCustom query> 0
Agent Decisions/Minagent.decision.count-
Learning Rateagent.learning.patterns_learned-

5.5 Phase-Specific Dashboards

Implementation Dashboard

yaml
panels: - title: "Code Generation Rate" query: rate(sdlc.implementation.lines_written[5m]) - title: "Validation Success Rate" query: | 1 - ( rate(agent.implementer.validation_failures[5m]) / rate(sdlc.implementation.modules_completed[5m]) ) - title: "Tool Usage Distribution" type: pie query: | SELECT tool_name, count() FROM sdlc_events WHERE event_type = 'tool.executed' GROUP BY tool_name - title: "Iteration Count Distribution" type: histogram query: | SELECT iteration_count, count() FROM sdlc_events WHERE event_type = 'agent.loop_iteration' GROUP BY iteration_count

Deployment Dashboard

yaml
panels: - title: "Deployment Success Rate" query: sdlc.cicd.success_rate - title: "Canary Health" query: | SELECT deployment_id, error_rate, latency_p95 FROM sdlc_events WHERE event_type = 'deployment.validated' ORDER BY timestamp DESC LIMIT 10 - title: "Rollback Timeline" type: timeline query: | SELECT timestamp, deployment_id, reason FROM sdlc_events WHERE event_type = 'deployment.rollback_initiated' - title: "Traffic Distribution" type: stacked_area query: | SELECT timestamp, deployment_id, payload.new_percent as traffic_percent FROM sdlc_events WHERE event_type = 'deployment.traffic_shifted'

6. Alerting Thresholds

6.1 Severity Levels

LevelResponse TimeNotificationExamples
P1 - CriticalImmediatePage/SMS/VoiceSystem down, security breach, data loss
P2 - High15 minutesSlack/EmailHigh error rate, deployment failure
P3 - Medium1 hourSlackElevated latency, cost spike
P4 - Low4 hoursEmailMinor degradation, warning threshold
P5 - InfoNext business dayDashboardTrends, recommendations

6.2 Alert Rules by Category

System Health Alerts

yaml
alerts: - name: AgentStuck severity: P2 condition: | max_over_time( (time() - sdlc_agent_last_activity_timestamp)[5m:] ) > 300 for: 2m annotations: summary: "Agent {{ $labels.agent_id }} has been stuck for 5+ minutes" - name: HighErrorRate severity: P1 condition: | ( sum(rate(sdlc_events_total{level="ERROR"}[5m])) / sum(rate(sdlc_events_total[5m])) ) > 0.1 for: 2m annotations: summary: "Error rate above 10%" - name: QueueBackup severity: P2 condition: sdlc_queue_depth > 50 for: 5m annotations: summary: "Pipeline queue backing up ({{ $value }} items)"

Performance Alerts

yaml
alerts: - name: SlowAgentLoop severity: P3 condition: | histogram_quantile(0.99, sum(rate(agent_loop_duration_bucket[5m])) by (le, agent_type) ) > 30 for: 5m annotations: summary: "P99 agent loop duration > 30s for {{ $labels.agent_type }}" - name: HighLatency severity: P3 condition: | histogram_quantile(0.95, sum(rate(sdlc_phase_duration_bucket[5m])) by (le, phase) ) > 300 for: 10m annotations: summary: "P95 phase duration > 5 minutes for {{ $labels.phase }}" - name: CacheHitRateLow severity: P4 condition: | ( sum(rate(agent_cache_hits[5m])) / sum(rate(agent_cache_operations[5m])) ) < 0.5 for: 15m annotations: summary: "Cache hit rate below 50%"

Cost Alerts

yaml
alerts: - name: CostSpike severity: P3 condition: | ( sum(increase(agent_cost_total_usd[1h])) > 2 * sum(increase(agent_cost_total_usd[1h] offset 24h)) ) for: 15m annotations: summary: "Cost spike detected: 2x normal hourly rate" - name: HighCostProject severity: P4 condition: | sum(increase(agent_cost_total_usd[24h])) by (project_id) > 100 for: 1h annotations: summary: "Project {{ $labels.project_id }} exceeded $100/day" - name: ExpensiveOperation severity: P4 condition: agent_cost_per_operation > 0.5 for: 0m annotations: summary: "Single operation cost > $0.50"

Quality Alerts

yaml
alerts: - name: DeploymentFailure severity: P1 condition: | increase(sdlc_deployment_completed_total{status="failed"}[5m]) > 0 for: 0m annotations: summary: "Deployment failure detected" - name: HighRollbackRate severity: P2 condition: | ( sum(rate(sdlc_cicd_rollback_total[1h])) / sum(rate(sdlc_cicd_deployment_total[1h])) ) > 0.05 for: 10m annotations: summary: "Rollback rate above 5%" - name: TestPassRateLow severity: P2 condition: sdlc_testing_pass_rate < 0.9 for: 5m annotations: summary: "Test pass rate below 90%" - name: FlakyTestsDetected severity: P4 condition: sdlc_testing_flakiness_rate > 0.1 for: 1h annotations: summary: "Flaky test rate above 10%"

Security Alerts

yaml
alerts: - name: SecurityFindingCritical severity: P1 condition: | increase(sdlc_review_findings_total{severity="critical"}[5m]) > 0 for: 0m annotations: summary: "Critical security finding detected" - name: SecretsExposed severity: P1 condition: | increase(sdlc_cicd_security_scan_secrets_found[5m]) > 0 for: 0m annotations: summary: "Potential secrets exposed in code"

6.3 Alert Routing

yaml
routing: default: team-sdlc-oncall routes: - match: severity: P1 receiver: pagerduty-critical continue: true - match: severity: P2 receiver: slack-alerts-high - match: severity: P3 receiver: slack-alerts-medium - match: alertname: CostSpike receiver: finance-team - match: alertname: SecretsExposed receiver: security-team - match: agent_type: deployer receiver: platform-team receivers: pagerduty-critical: pagerduty_configs: - service_key: "${PAGERDUTY_KEY}" severity: critical slack-alerts-high: slack_configs: - channel: "#sdlc-alerts-high" send_resolved: true slack-alerts-medium: slack_configs: - channel: "#sdlc-alerts-medium" send_resolved: true finance-team: email_configs: - to: "finance@company.com" security-team: slack_configs: - channel: "#security-incidents" send_resolved: true

6.4 Alert Suppression

yaml
inhibition_rules: # Suppress lower-severity alerts when critical is firing - source_match: severity: P1 target_match: severity: P2 equal: ['project_id', 'phase'] # Suppress agent-specific alerts when orchestrator is down - source_match: alertname: OrchestratorDown target_match_re: alertname: Agent.* equal: ['environment'] silences: # Maintenance windows - matchers: - name: environment value: staging startsAt: "2026-02-07T02:00:00Z" endsAt: "2026-02-07T04:00:00Z" comment: "Scheduled maintenance"

7. Implementation Checklist

Phase 1: Foundation

  • Deploy Prometheus for metrics collection
  • Deploy Loki for log aggregation
  • Deploy Jaeger/Tempo for distributed tracing
  • Configure OpenTelemetry SDK in agents
  • Set up Kafka for event streaming

Phase 2: Storage

  • Deploy ClickHouse for analytics
  • Configure S3 lifecycle policies
  • Set up Thanos for long-term metric storage
  • Deploy vector database for agent memory

Phase 3: Visualization

  • Deploy Grafana
  • Create executive dashboard
  • Create operational dashboards
  • Create phase-specific dashboards

Phase 4: Alerting

  • Deploy Alertmanager
  • Configure PagerDuty integration
  • Set up Slack notifications
  • Define alert routing rules
  • Create runbooks for each alert

Phase 5: Optimization

  • Implement sampling strategies
  • Tune retention policies
  • Optimize query performance
  • Set up cost monitoring
  • Create anomaly detection

8. References

Related Research

External Standards

  • OpenTelemetry Specification
  • Prometheus Best Practices
  • W3C Distributed Tracing
  • CloudEvents Specification

Generated: 2026-02-06
Version: 1.0