Hephaestus Monitoring Architecture - Implementation Deep Dive

📚 Looking for the user guide? This is a technical deep-dive for contributors and advanced users. If you want to understand how Guardian works from a user perspective, see Guardian Monitoring Guide.

System Architecture Overview

Component Details

Monitoring Loop

Location: run_monitor.py, src/monitoring/monitor.py
Responsibility: Orchestrates the monitoring cycle every 60 seconds
Key Methods:
- _monitoring_cycle(): Main cycle execution
- _guardian_analysis_for_agent(): Individual agent analysis
- _save_conductor_analysis(): System analysis persistence

Guardian System

Location: src/monitoring/guardian.py
Responsibility: Individual agent trajectory monitoring
Key Features:
- Grace period for new agents (default: 60 seconds) before monitoring begins
- Builds accumulated context from entire conversation
- Retrieves and uses past summaries for continuity
- Calls GPT-5 for intelligent trajectory analysis
- Provides targeted steering interventions

Conductor System

Location: src/monitoring/conductor.py
Responsibility: System-wide coherence and orchestration
Key Features:
- Analyzes all Guardian summaries collectively
- Detects duplicate work across agents
- Makes termination and coordination decisions
- Maintains system coherence score

Trajectory Context Builder

Location: src/monitoring/trajectory_context.py
Responsibility: Extracts meaningful context from agent history
Key Extractions:
- Overall goals and their evolution
- Persistent and lifted constraints
- Standing instructions
- Current focus and blockers

Prompt Loader

Location: src/monitoring/prompt_loader.py, src/prompts/
Responsibility: Manages GPT-5 prompt templates
Templates:
- guardian_trajectory_analysis.md - Guardian agent analysis prompts
- conductor_system_analysis.md - Conductor system analysis prompts
Key Features:
- Dynamic template loading from markdown files
- Context injection using Python .format()
- Structured JSON response formatting

Data Flow

Database Schema

Monitoring Cycle Phases

Key Algorithms

Accumulated Context Building

def build_accumulated_context(agent_id: str) -> Dict[str, Any]:
    """
    Builds complete context from agent's entire session.

    1. Get all agent logs ordered by time
    2. Extract overall goal from task and conversation
    3. Track goal evolution through conversation
    4. Extract persistent constraints ("must", "cannot")
    5. Identify lifted constraints
    6. Extract standing instructions ("always", "remember")
    7. Identify current focus (most recent activity)
    8. Discover blockers from errors and stuck patterns
    9. Calculate session duration and conversation length

    Returns structured context for GPT-5 analysis.
    """

Trajectory Alignment Scoring

def calculate_alignment_score(
    trajectory: Dict[str, Any],
    goal: str,
    constraints: List[str]
) -> float:
    """
    GPT-5 calculates alignment score (0.0-1.0) based on:

    - Goal progress (0-40%):
      * How much of the goal is completed
      * Whether work contributes to goal

    - Constraint adherence (0-30%):
      * Following active constraints
      * Not violating any constraints

    - Efficiency (0-20%):
      * Not stuck or repeating
      * Making forward progress

    - Focus (0-10%):
      * Staying on current task
      * Not context switching

    The score is determined by GPT-5's analysis of the agent's
    trajectory_summary against the accumulated_goal and constraints.
    """

Duplicate Detection

def detect_duplicates(
    summaries: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """
    GPT-5 detects duplicates by analyzing:

    1. Work descriptions similarity
    2. File/component overlap
    3. Goal similarity
    4. Approach similarity

    Returns list of duplicate pairs with similarity scores.
    """

Performance Metrics

Configuration Parameters

# Monitoring Settings
monitoring:
  interval_seconds: 60                 # How often to run monitoring cycle
  parallel_analysis: true              # Analyze agents in parallel
  max_concurrent: 10                   # Max concurrent Guardian analyses
  guardian_min_agent_age_seconds: 60   # Grace period before Guardian monitors new agents

# Guardian Settings
guardian:
  past_summaries_limit: 10    # Number of past summaries to include
  context_history_lines: 200  # Lines of conversation to include
  tmux_output_lines: 100      # Lines from tmux to analyze
  cache_duration_minutes: 5   # How long to cache summaries

# Conductor Settings
conductor:
  min_agents_for_analysis: 2  # Minimum agents for system analysis
  duplicate_threshold: 0.8    # Similarity score for duplicates
  coherence_thresholds:
    critical: 0.3             # Below this triggers escalation
    warning: 0.5              # Below this increases monitoring
    healthy: 0.7              # Above this is normal

# Intervention Settings
interventions:
  max_nudges_before_restart: 3
  restart_cooldown_minutes: 10
  steering_types:
    - general
    - stuck
    - confused
    - wrong_direction
    - violating_constraints

Integration Points

MCP Server Endpoints

# Get agent trajectories
GET /api/agent_trajectories
Response: {
    "agents": [
        {
            "agent_id": "uuid",
            "current_phase": "implementation",
            "alignment_score": 0.85,
            "trajectory_summary": "Building auth..."
        }
    ]
}

# Get system coherence
GET /api/system_coherence
Response: {
    "coherence_score": 0.75,
    "active_agents": 5,
    "duplicates": 1,
    "system_status": "Healthy with minor duplicates"
}

# Manual steering
POST /api/steer_agent
Body: {
    "agent_id": "uuid",
    "steering_type": "stuck",
    "message": "Try using X approach"
}

Frontend Dashboard Integration

// Real-time monitoring data
interface MonitoringData {
    agents: AgentTrajectory[];
    systemCoherence: number;
    duplicates: Duplicate[];
    interventions: Intervention[];
}

// WebSocket updates
ws.on('monitoring_update', (data: MonitoringData) => {
    updateAgentGrid(data.agents);
    updateCoherenceChart(data.systemCoherence);
    highlightDuplicates(data.duplicates);
});

Troubleshooting Guide

Common Issues

Debug Commands

# Check monitoring logs
tail -f logs/monitor.log | grep -E "Guardian|Conductor"

# Check database state
sqlite3 hephaestus.db "
    SELECT COUNT(*) as analyses,
           AVG(alignment_score) as avg_score
    FROM guardian_analyses
    WHERE timestamp > datetime('now', '-1 hour');
"

# Check system coherence
sqlite3 hephaestus.db "
    SELECT timestamp, coherence_score, system_status
    FROM conductor_analyses
    ORDER BY timestamp DESC LIMIT 5;
"

# Check for stuck agents
sqlite3 hephaestus.db "
    SELECT agent_id, COUNT(*) as stuck_count
    FROM guardian_analyses
    WHERE needs_steering = 1
    GROUP BY agent_id
    HAVING stuck_count > 3;
"

Future Enhancements

Planned Features

Machine Learning Integration
- Learn from successful interventions
- Predict trajectory deviations
- Optimize steering messages
Advanced Duplicate Detection
- Semantic code analysis
- Git diff integration
- Automatic work redistribution
Multi-Model Support
- Claude for trajectory analysis
- GPT-4 for code understanding
- Specialized models for different phases
Enhanced Visualizations
- Real-time trajectory graphs
- System coherence heatmaps
- Intervention effectiveness charts

System Architecture Overview​

Component Details​

Monitoring Loop​

Guardian System​

Conductor System​

Trajectory Context Builder​

Prompt Loader​

Data Flow​

Database Schema​

Monitoring Cycle Phases​

Key Algorithms​

Accumulated Context Building​

Trajectory Alignment Scoring​

Duplicate Detection​

Performance Metrics​

Configuration Parameters​

Integration Points​

MCP Server Endpoints​

Frontend Dashboard Integration​

Troubleshooting Guide​

Common Issues​

Debug Commands​

Future Enhancements​

Planned Features​

Scalability Roadmap​

System Architecture Overview

Component Details

Monitoring Loop

Guardian System

Conductor System

Trajectory Context Builder

Prompt Loader

Data Flow

Database Schema

Monitoring Cycle Phases

Key Algorithms

Accumulated Context Building

Trajectory Alignment Scoring

Duplicate Detection

Performance Metrics

Configuration Parameters

Integration Points

MCP Server Endpoints

Frontend Dashboard Integration

Troubleshooting Guide

Common Issues

Debug Commands

Future Enhancements

Planned Features

Scalability Roadmap