Diagnostic Agent System
Overview
The Diagnostic Agent System is a self-healing mechanism that prevents workflows from getting permanently stuck. When all tasks are complete but the workflow goal hasn't been achieved, a specialized diagnostic agent analyzes the situation and creates new tasks to push the workflow forward.
Purpose
In complex workflows, agents sometimes:
- Complete their individual tasks successfully but miss the bigger picture
- Fail to submit final results even though the work is done
- Get stuck in a particular phase when they should move to another
- Need to revisit earlier phases based on failures in later phases
The diagnostic agent serves as a "workflow doctor" that:
- Detects when the workflow is stuck
- Analyzes what's been accomplished
- Diagnoses what's missing
- Creates targeted tasks to achieve the workflow goal
When It Activates
The diagnostic agent triggers automatically when ALL of the following conditions are met:
- Active workflow exists: A workflow with phases is currently running
- Tasks exist: At least one task has been created in the workflow
- All tasks finished: No tasks have status pending,assigned,in_progress,under_review, orvalidation_in_progress
- No validated result: No WorkflowResultwith statusvalidatedhas been submitted
- Cooldown passed: At least diagnostic_cooldown_seconds(default: 60s) have passed since the last diagnostic agent was created
- Stuck long enough: At least diagnostic_min_stuck_time_seconds(default: 60s) have passed since the last task was created or completed
How It Works
1. Detection (MonitoringLoop)
Every monitoring cycle (default: 60 seconds), the MonitoringLoop._check_workflow_stuck_state() method:
# Pseudo-code
if workflow_exists and has_tasks:
    if all_tasks_finished and no_validated_result:
        if cooldown_passed and stuck_long_enough:
            create_diagnostic_agent()
2. Context Gathering
When triggered, the system gathers comprehensive context:
Workflow Information:
- Workflow goal (from result_criteria)
- All phase definitions with their objectives
- Current phase statuses
Recent History:
- Last 15 completed/failed agents (configurable)
- Their task descriptions, statuses, and outcomes
- Completion notes and failure reasons
System Observations:
- Last 5 Conductor system analyses
- Duplicate work detections
- System coherence scores
Submitted Results:
- Any result submissions (even if rejected)
- Validation feedback explaining rejections
3. Agent Creation
A diagnostic task and agent are created:
Task(
    description="DIAGNOSTIC: Analyze why workflow has stalled and create tasks to progress toward goal",
    done_definition="Created 1-5 new tasks with clear phase assignments",
    agent_type="diagnostic",
    phase_id=None,  # Diagnostic tasks span all phases
)
The diagnostic agent:
- Works in the main repository (no worktree isolation)
- Gets a specialized prompt with all gathered context
- Has access to all Hephaestus MCP tools
- Can create tasks in any phase
4. Diagnostic Process
The diagnostic agent follows a structured 4-step process:
Step 1: Understand the Goal
- Reads the workflow's result_criteria
- Identifies what "success" looks like
Step 2: Analyze Current State
- Reviews what agents have accomplished
- Examines which phases have progressed
- Checks what outputs have been created
- Analyzes any result submission failures
Step 3: Identify the Gap
- Diagnoses why the goal hasn't been achieved
- Identifies common stuck scenarios:
- Missing evidence/documentation
- Incomplete implementation
- Wrong direction
- Premature task completion
- Phase misalignment
- Validation failures
 
Step 4: Create Tasks
- Uses create_taskMCP tool to create 1-5 tasks
- Assigns tasks to appropriate phases
- Defines concrete completion criteria
- Marks diagnostic task as done
5. Workflow Progression
Once the diagnostic agent creates new tasks:
- Tasks are picked up by regular agents
- Workflow progresses toward the goal
- System continues monitoring
- Another diagnostic may trigger if needed (after cooldown)
Configuration
YAML Configuration (hephaestus_config.yaml)
diagnostic_agent:
  enabled: true  # Enable/disable diagnostic agents
  cooldown_seconds: 60  # Min time between diagnostics
  min_stuck_time_seconds: 60  # How long "stuck" before triggering
  max_agents_to_analyze: 15  # Number of recent agents in context
  max_conductor_analyses: 5  # Number of Conductor analyses in context
  max_tasks_per_run: 5  # Max tasks diagnostic can create
Environment Variables
DIAGNOSTIC_AGENT_ENABLED=true
DIAGNOSTIC_COOLDOWN_SECONDS=60
DIAGNOSTIC_MIN_STUCK_TIME=60
SDK Configuration
from hephaestus_sdk import HephaestusConfig
config = HephaestusConfig(
    diagnostic_agent_enabled=True,
    diagnostic_cooldown_seconds=60,
    diagnostic_min_stuck_time_seconds=60,
)
Database Schema
DiagnosticRun Table
Tracks each diagnostic agent execution:
CREATE TABLE diagnostic_runs (
    id TEXT PRIMARY KEY,
    workflow_id TEXT NOT NULL,
    diagnostic_agent_id TEXT,
    diagnostic_task_id TEXT,
    -- Trigger conditions
    triggered_at DATETIME NOT NULL,
    total_tasks_at_trigger INTEGER NOT NULL,
    done_tasks_at_trigger INTEGER NOT NULL,
    failed_tasks_at_trigger INTEGER NOT NULL,
    time_since_last_task_seconds INTEGER NOT NULL,
    -- Results
    tasks_created_count INTEGER DEFAULT 0,
    tasks_created_ids JSON,
    completed_at DATETIME,
    status TEXT CHECK(status IN ('created', 'running', 'completed', 'failed')),
    -- Analysis context
    workflow_goal TEXT,
    phases_analyzed JSON,
    agents_reviewed JSON,
    diagnosis TEXT,
    FOREIGN KEY (workflow_id) REFERENCES workflows(id),
    FOREIGN KEY (diagnostic_agent_id) REFERENCES agents(id),
    FOREIGN KEY (diagnostic_task_id) REFERENCES tasks(id)
);
Agent Type Update
The agents.agent_type constraint now includes 'diagnostic':
agent_type TEXT CHECK(agent_type IN ('phase', 'validator', 'result_validator', 'monitor', 'diagnostic'))
Monitoring & Observability
Logs
Diagnostic agents produce distinctive log messages:
🚨 WORKFLOW STUCK DETECTED - 120s with no progress
🔍 Creating diagnostic agent for workflow abc12345
✅ Diagnostic agent def67890 created for workflow abc12345
Database Queries
View all diagnostic runs:
SELECT * FROM diagnostic_runs ORDER BY triggered_at DESC;
Check diagnostic effectiveness:
SELECT
    dr.id,
    dr.triggered_at,
    dr.tasks_created_count,
    dr.status,
    COUNT(t.id) as tasks_completed
FROM diagnostic_runs dr
LEFT JOIN tasks t ON t.created_by_agent_id = dr.diagnostic_agent_id
    AND t.status = 'done'
GROUP BY dr.id;
See which phases diagnostics create tasks in:
SELECT
    p.name as phase_name,
    COUNT(t.id) as tasks_created
FROM tasks t
JOIN agents a ON t.created_by_agent_id = a.id
JOIN phases p ON t.phase_id = p.id
WHERE a.agent_type = 'diagnostic'
GROUP BY p.name;
Troubleshooting
Diagnostic Not Triggering
Symptoms: Workflow seems stuck but no diagnostic agent is created
Check:
- Is diagnostic_agent_enabledset totrue?
- Are there any active tasks? (Check taskstable)
- Has cooldown period passed? (Check diagnostic_runsfor last run)
- Has workflow been stuck long enough? (Check diagnostic_min_stuck_time_seconds)
Debug:
# Check workflow status
SELECT workflow_id, status FROM tasks WHERE workflow_id = '<workflow_id>';
# Check last diagnostic
SELECT * FROM diagnostic_runs
WHERE workflow_id = '<workflow_id>'
ORDER BY triggered_at DESC LIMIT 1;
Diagnostic Creating Wrong Tasks
Symptoms: Diagnostic creates tasks but they don't help
Possible causes:
- Insufficient context (increase max_agents_to_analyze)
- Poor workflow goal definition (review result_criteria)
- Diagnostic agent misunderstood situation
Solutions:
- Review diagnostic agent's output in tmux session
- Check diagnosisfield indiagnostic_runstable
- Improve workflow phase done_definitions for clarity
Too Many Diagnostics
Symptoms: Diagnostics keep triggering in a loop
Causes:
- Cooldown too short
- Diagnostic creates tasks that immediately complete
Solutions:
diagnostic_agent:
  cooldown_seconds: 120  # Increase cooldown
  min_stuck_time_seconds: 120  # Require longer stuck time
Diagnostic Agent Fails
Symptoms: Diagnostic task shows status failed
Check:
- Diagnostic agent logs in tmux
- failure_reasonin tasks table
- MCP tool availability
Recovery:
- System will retry after cooldown period
- Investigate and fix underlying issue
- Manually create tasks if needed
Best Practices
1. Clear Workflow Goals
Define concrete, measurable result_criteria:
# ❌ Vague
result_criteria: "Complete the project"
# ✅ Specific
result_criteria: |
  Submit a result.md file containing:
  - The cracked password
  - Full methodology used
  - Execution outputs as proof
  - Use submit_result() tool to submit
2. Detailed Done Definitions
Help diagnostic agents understand what "done" means:
# ❌ Vague
Done_Definitions:
  - "Tests pass"
# ✅ Specific
Done_Definitions:
  - "All unit tests in tests/ directory pass with 0 failures"
  - "Integration tests in tests/integration/ execute successfully"
  - "Test results saved to test_results.txt with timestamps"
3. Completion Notes
Agents should provide detailed completion notes:
# Help diagnostic understand what was actually done
update_task_status(
    task_id="...",
    status="done",
    summary="Created test_password.go with 15 test cases. All tests pass. Output saved to test_output.txt"
)
4. Monitor Diagnostic Effectiveness
Regularly check:
-- Diagnostic success rate
SELECT
    COUNT(CASE WHEN tasks_created_count > 0 THEN 1 END) as successful,
    COUNT(*) as total,
    ROUND(100.0 * COUNT(CASE WHEN tasks_created_count > 0 THEN 1 END) / COUNT(*), 2) as success_rate
FROM diagnostic_runs;
Integration with Existing Systems
Guardian & Conductor
Diagnostic agents work alongside:
- Guardian: Monitors individual agent health
- Conductor: Detects system-wide issues (duplicates, coherence)
- Diagnostic: Handles workflow-level stuckness
They complement each other:
- Guardian/Conductor run every monitoring cycle
- Diagnostic only triggers when workflow is stuck
- All three share the same monitoring infrastructure
Validation System
Diagnostic agents respect the validation system:
- Won't trigger if workflow has validated result
- Considers validation feedback when analyzing
- May create validation tasks if results were rejected
Phase System
Diagnostic agents are phase-aware:
- Can create tasks in any phase (not just current)
- May recommend going back to earlier phases
- Understands phase dependencies and progression
Examples
Example 1: Missing Result Submission
Situation:
- All tests passed
- No result submitted
Diagnostic finds:
- Tasks show "tests pass" but no evidence file
- No submit_resultcalls in logs
Tasks created:
Phase 3: "Create evidence.md documenting all test outputs and execution steps"
Phase 3: "Submit result using submit_result() tool with evidence.md as proof"
Example 2: Implementation Incomplete
Situation:
- "Implementation" phase tasks all done
- "Testing" phase tasks failing
Diagnostic finds:
- Tests can't run - missing dependencies
- Implementation didn't include setup steps
Tasks created:
Phase 2: "Add dependency installation to setup.sh script"
Phase 2: "Document build prerequisites in BUILD.md"
Phase 3: "Re-run tests after dependencies are installed"
Example 3: Wrong Architectural Approach
Situation:
- Multiple implementation attempts failed
- All with similar errors
Diagnostic finds:
- Approach doesn't match codebase architecture
- Need to revisit planning phase
Tasks created:
Phase 1: "Analyze existing codebase architecture in detail"
Phase 1: "Design integration approach matching current patterns"
Phase 2: "Implement using new architectural approach from Phase 1"
Future Enhancements
Potential improvements to the diagnostic system:
- 
Learning from Past Diagnostics - Store successful diagnostic patterns
- Use RAG to suggest similar solutions
 
- 
Multi-Agent Diagnostics - Create diagnostic teams for complex analysis
- Parallel investigation of different hypotheses
 
- 
Proactive Diagnostics - Trigger before complete stuck state
- Based on trajectory analysis
 
- 
User Notifications - Alert users when diagnostic triggers
- Request human input for ambiguous situations
 
Support
For issues with the diagnostic agent system:
- Check logs in logs/monitor.log
- Query diagnostic_runstable for history
- Review diagnostic agent tmux sessions
- Open issue on GitHub with diagnostic run ID