Queue and Task Management System
Overview
The Queue and Task Management system provides intelligent concurrency control and task lifecycle management for Hephaestus. It allows administrators to:
- Control resource usage by limiting concurrent agents
- Queue tasks when at capacity instead of overwhelming the system
- Prioritize urgent work with "Bump & Start" functionality
- Manage running agents by terminating them when needed
- Restart tasks that failed or need to be re-run
- Cancel queued tasks that are no longer needed
This system prevents machine freezing from too many concurrent agents and reduces token usage by queuing work intelligently.
Architecture Overview
Quick Reference
| Action | Endpoint | Effect | 
|---|---|---|
| Check queue status | GET /api/queue_status | View active agents, queued tasks | 
| Start queued task now | POST /api/bump_task_priority | Bypass limit, start immediately | 
| Cancel queued task | POST /api/cancel_queued_task | Remove from queue, mark as failed | 
| Restart task | POST /api/restart_task | Reset and re-queue/start task | 
| Terminate agent | POST /api/terminate_agent | Kill agent, process queue | 
Key Concepts:
- At capacity: Active agents = max_concurrent_agents
- Queued: Task waiting for an agent slot to open
- Bump & Start: Bypass limit to start urgent task immediately
- Auto-processing: Queue advances automatically when agents complete
Configuration
Max Concurrent Agents
The system is configured via hephaestus_config.yaml:
mcp:
  max_concurrent_agents: 2  # Maximum number of agents that can run simultaneously
Configuration Details:
- Config file: hephaestus_config.yaml(root directory)
- SDK support: Can also be set via Python SDK
- Environment variables: Not supported (by design, for consistency)
- Default value: 10 agents
- Recommended range: 2-5 for development, 5-10 for production
Queue Behavior
When a new task is created:
- 
Under capacity (e.g., 1/2 active agents): - Task is enriched immediately
- Agent is created and started right away
- Task status: pending→assigned→in_progress
 
- 
At capacity (e.g., 2/2 active agents): - Task is enriched immediately (ready to start when slot opens)
- Task is added to queue
- Task status: pending→queued
- When an agent completes/terminates, queue automatically processes next task
 
Task State Machine
Understanding how tasks flow through the queue system:
Queue Priority System
Tasks in the queue are ordered by:
- Priority Boosted (manually bumped tasks) - highest priority
- Task Priority (high > medium > low)
- Queue Time (FIFO - first in, first out)
Queue Position
Each queued task has:
- queue_position: Integer position in queue (1, 2, 3...)
- priority_boosted: Boolean flag for manually prioritized tasks
- queued_at: Timestamp when task was queued
Position #1 is "NEXT UP" - will start as soon as a slot opens.
Task Actions
1. Bump & Start (Start Now)
Purpose: Force a queued task to start immediately, bypassing the agent limit.
Behavior:
- Moves task to queue position #1 (if not already)
- Bypasses max concurrent agents limit
- Creates agent immediately
- System goes over capacity (e.g., 2/2 → 3/2)
- When agents complete, system naturally returns to configured limit
Use Case: Urgent work that can't wait for a slot to open.
API Endpoint: POST /api/bump_task_priority
Request:
{
  "task_id": "task-uuid"
}
Response:
{
  "success": true,
  "message": "Task abc123 started immediately (bypassing agent limit)",
  "agent_id": "agent-uuid"
}
UI Locations:
- Task Details modal: Orange "Start Now" button (for queued tasks)
- Queue section: ⚡ icon button (hover to reveal)
2. Cancel Queued Task
Purpose: Remove a task from the queue and mark it as failed.
Behavior:
- Task is removed from queue
- Task status set to failed
- Failure reason: "Cancelled by user from queue"
- No agent is created
Use Case: Task no longer needed or was created by mistake.
API Endpoint: POST /api/cancel_queued_task
Request:
{
  "task_id": "task-uuid"
}
Response:
{
  "success": true,
  "message": "Task abc123 cancelled and removed from queue"
}
UI Location:
- Queue section: 🗑️ trash icon button (hover to reveal)
3. Restart Task
Purpose: Restart a completed or failed task with a clean slate.
Behavior:
- Clears all completion data:
- failure_reason
- completion_notes
- completed_at
- started_at
- assigned_agent_id
 
- Clears all trajectory data:
- Guardian analyses for old agent
- Steering interventions for old agent
 
- Resets task to pendingstatus
- If under capacity → creates agent immediately
- If at capacity → queues task
Use Case: Retry failed tasks or re-run completed tasks.
API Endpoint: POST /api/restart_task
Request:
{
  "task_id": "task-uuid"
}
Response (started immediately):
{
  "success": true,
  "message": "Task abc123 restarted with new agent",
  "agent_id": "agent-uuid",
  "status": "assigned"
}
Response (queued):
{
  "success": true,
  "message": "Task abc123 restarted and added to queue",
  "status": "queued"
}
UI Location:
- Task Details modal: Purple "Restart Task" button (for done/failed tasks only)
4. Terminate Agent
Purpose: Forcefully stop a running agent and fail its task.
Behavior:
- Terminates tmux session
- Marks agent as terminated
- Marks task as failed
- Frees up agent slot
- Queue automatically processes next task
Use Case: Agent is stuck, doing wrong work, or need to free capacity.
API Endpoint: POST /api/terminate_agent
Request:
{
  "agent_id": "agent-uuid",
  "reason": "Manual termination from UI"
}
Response:
{
  "success": true,
  "message": "Agent abc123 terminated successfully"
}
UI Locations:
- Task Details modal: Red "Terminate Agent" button (for active agents)
- Agents page: Red "Terminate Agent" button on agent cards
- Queue section (running agents): ❌ icon button (hover to reveal)
Queue Section UI Component
The Queue Section appears at the top of the Tasks page when the system is at capacity or has queued tasks.
Collapsed State (~48px height)
Shows at-a-glance info:
- Agent utilization bar
- Slot count (e.g., 2/2)
- Preview of next task
- Number of waiting tasks
Expanded State (max ~300px, scrollable)
Shows two sections:
Running Agents:
- List of active agents
- Current task for each agent
- Runtime duration
- Quick actions: View (👁️), Terminate (❌)
Queued Tasks:
- List of queued tasks in priority order
- Position badges (1, 2, 3...)
- Task #1 highlighted as "NEXT UP" (orange background)
- Priority badges (HIGH/MED/LOW)
- Time queued
- Quick actions: View (👁️), Start Now (⚡), Cancel (🗑️)
- Max height: 200px with scroll
Real-time Updates
The Queue Section subscribes to WebSocket events for instant updates. See WebSocket Events section below for complete event reference.
WebSocket Events
The queue system emits real-time events for UI updates:
| Event Type | When Emitted | Payload Fields | 
|---|---|---|
| task_queued | Task added to queue | task_id,description,queue_position,slots_available | 
| task_created | New task created | task_id,description,agent_id(if started immediately) | 
| task_completed | Task finished | task_id,agent_id,status,summary | 
| task_priority_bumped | Task bumped and started | task_id,agent_id | 
| agent_created | New agent spawned | agent_id,task_id | 
| agent_status_changed | Agent status updated | agent_id,status | 
UI Components Subscribed:
- QueueSection: All events
- TasksPage: task_*events
- AgentList: agent_*events
Implementation Details
Queue Service (src/services/queue_service.py)
Core methods:
should_queue_task() -> bool
- Checks if active agent count >= max concurrent agents
- Returns True if task should be queued instead of started
enqueue_task(task_id: str) -> None
- Adds task to queue
- Sets status = "queued"
- Sets queued_at = now()
- Calculates queue_position
- Automatically recalculates all queue positions
get_next_queued_task() -> Task | None
- Returns highest priority queued task
- Ordering: priority_boosted > priority > queued_at
- Returns None if queue is empty
dequeue_task(task_id: str) -> None
- Removes task from queue
- Sets task status to "assigned"
- Clears queue metadata (queue_position)
- Recalculates positions for remaining tasks
boost_task_priority(task_id: str) -> bool
- Sets priority_boosted = True
- Moves task to position #1
- Recalculates all queue positions
- Returns success status
get_queue_status() -> dict
- Returns complete queue state:
- Active agent count
- Max concurrent agents
- Queued tasks (with details)
- Available slots
- At capacity status
 
Automatic Queue Processing
When does the queue process?
The process_queue() function is called automatically when:
- A task completes (status → done)
- A task fails (status → failed)
- An agent is terminated
What happens?
async def process_queue():
    # Check capacity first
    if should_queue_task():
        return  # At capacity, don't process
    # Get next queued task
    next_task = get_next_queued_task()
    if not next_task:
        return  # Queue is empty
    # Dequeue and start
    dequeue_task(next_task.id)
    create_agent_for_task(next_task)
Key Point: Queue processing respects the max concurrent agents limit. Only "Bump & Start" bypasses it.
Database Schema
Queue management fields added to the Task model:
Location: src/core/database.py
class Task(Base):
    __tablename__ = "tasks"
    # ... existing fields (id, description, status, etc.) ...
    # Queue management fields
    queued_at = Column(DateTime)           # Timestamp when task was queued
    queue_position = Column(Integer)       # Position in queue (1, 2, 3...)
    priority_boosted = Column(Boolean, default=False)  # Manual priority boost flag
Field Usage:
- queued_at: Set when task enters queue, used for FIFO ordering within same priority
- queue_position: Calculated field showing task's place in line, displayed in UI
- priority_boosted: Set to True when "Bump & Start" is used, gives highest priority
Note: These fields are NULL for tasks that never enter the queue (started immediately).
API Endpoints Summary
| Endpoint | Method | Purpose | Bypasses Limit? | 
|---|---|---|---|
| /api/bump_task_priority | POST | Start queued task immediately | ✅ Yes | 
| /api/cancel_queued_task | POST | Remove task from queue | N/A | 
| /api/restart_task | POST | Restart done/failed task | ❌ No | 
| /api/terminate_agent | POST | Kill running agent | N/A | 
| /api/queue_status | GET | Get queue state | N/A | 
User Workflows
Workflow 1: System at Capacity
Scenario: You have 2/2 agents running, create a new task.
Steps:
- New task is created
- System checks capacity: should_queue_task()returns True
- Task is enriched (ready to start)
- Task is queued at position #3
- Queue section appears on Tasks page
- User sees: "2/2 active, 3 waiting"
What happens next:
- When an agent completes → queue automatically processes → task #1 starts
- User can "Start Now" to bypass limit → goes to 3/2 temporarily
Workflow 2: Urgent Task in Queue
Scenario: Task at position #5 is suddenly urgent.
- Open Tasks page
- Expand queue section
- Find task at position #5
- Click ⚡ "Start Now" button
- Confirm: "This will bypass the agent limit"
- Task starts immediately (2/2 → 3/2)
- When any agent completes → back to 2/2
- Queue continues processing normally
Workflow 3: Restart Failed Task
Scenario: Task failed, want to try again.
- Open task in Task Details modal
- See status: "failed"
- Click purple "Restart Task" button
- Confirm: "This will clear completion data and trajectory"
- If under capacity → task starts immediately
- If at capacity → task is queued
- Old trajectory data is deleted
- Fresh agent starts from scratch
Workflow 4: Cancel Unwanted Task
Scenario: Created wrong task, it's in queue.
- Open Tasks page
- Expand queue section
- Find task in queue
- Click 🗑️ "Cancel" button
- Confirm: "Task will be marked as failed"
- Task removed from queue
- Queue positions recalculated
- Task marked as failed
Frontend Implementation
QueueSection Component
Location: frontend/src/components/QueueSection.tsx
Purpose: Real-time queue visualization and management interface that appears at the top of the Tasks page.
Props:
interface QueueSectionProps {
  onViewTaskDetails: (taskId: string) => void;  // Callback to open task modal
  onRefreshTasks: () => void;                   // Callback to refresh parent task list
}
Key Features:
- Collapsible UI (starts collapsed by default)
- Only renders when relevant (queue exists OR system at capacity)
- Real-time WebSocket updates for instant UI refresh
- Scrollable queue list (max 200px height)
- Hover to reveal action buttons (prevents accidental clicks)
- Responsive design with dark mode support
State Management:
- Uses React Query for data fetching (queryKey: 'queue-status')
- Auto-refetch interval: 3 seconds (fallback to polling)
- WebSocket subscriptions for instant updates
- Optimistic UI updates on user actions
Task Detail Modal Updates
New Buttons:
- 
Start Now (orange, queued tasks only) - Icon: ⚡ Zap
- Calls bumpTaskPriority()
 
- 
Restart Task (purple, done/failed tasks only) - Icon: 🔄 RotateCcw
- Calls restartTask()
 
- 
Terminate Agent (red, active agents only) - Icon: ❌ XCircle
- Calls terminateAgent()
 
Configuration Best Practices
Choosing max_concurrent_agents
Consider:
- Machine resources: CPU, RAM available
- Token budget: More agents = more API usage
- Task complexity: Heavy tasks need fewer concurrent slots
- Response time: Higher limit = faster throughput but more resource usage
Recommendations:
| Use Case | Recommended Limit | Reasoning | 
|---|---|---|
| Development/Testing | 2-3 | Easy to observe queue behavior | 
| Production (light tasks) | 5-8 | Good balance of throughput and control | 
| Production (heavy tasks) | 2-4 | Prevent resource exhaustion | 
| High-end machine | 10+ | Maximum throughput | 
Monitoring Queue Health
Key Metrics:
- Average queue wait time
- Queue depth over time
- Frequency of "Bump & Start" usage
- Agent completion rate
Healthy Queue:
- Tasks don't wait more than 5-10 minutes
- Queue rarely exceeds 5-10 tasks
- Few manual interventions needed
Unhealthy Queue:
- Tasks waiting 30+ minutes
- Queue consistently at 20+ tasks
- Frequent manual bumping needed
- Action: Increase max_concurrent_agentsor optimize task complexity
Troubleshooting
Issue: Tasks Stuck in Queue
Symptoms: Tasks queued but never start, even when capacity available.
Diagnosis Steps:
- 
Check current agent count: curl http://localhost:8000/api/queue_status | jqVerify active_agents<max_concurrent_agents
- 
Check for stuck agents: - Open UI → Agents page
- Look for agents with very long runtime (>30min)
- Check Guardian analyses for stuck indicators
 
- 
Review server logs: grep "process_queue" hephaestus.log
 grep "should_queue_task" hephaestus.log
Common Causes:
- Agents in "working" status that are actually stuck
- Database lock preventing queue updates
- process_queue() not being triggered after agent completion
Solutions:
- Quick fix: Manually terminate stuck agents via UI
- Investigate: Check Guardian logs for why agent is stuck
- Reset: Restart Hephaestus server if queue state is corrupted
- Prevention: Lower max_concurrent_agents if agents frequently get stuck
Issue: Queue Not Processing After Agent Completion
Symptoms: Agent completes, slot opens, but queue doesn't advance.
Diagnosis Steps:
- 
Verify task status in database: sqlite3 hephaestus.db "SELECT id, status, queue_position FROM tasks WHERE status='queued' ORDER BY queue_position;"
- 
Check server logs for process_queue calls: grep "Processing queued task" hephaestus.log | tail -10
- 
Check if WebSocket events are firing: - Open browser DevTools → Network → WS tab
- Complete an agent
- Look for task_completedevent
 
Common Causes:
- Task status not set to "queued" properly
- process_queue() not called after agent termination
- WebSocket connection dropped (UI won't update but queue should process)
Solutions:
- Database fix: Manually set task status: UPDATE tasks SET status='queued' WHERE id='task-id';
- Trigger manually: Call /api/queue_statusto refresh state
- Restart: Server restart will recalculate queue positions
- Verify code: Ensure all termination paths call process_queue()
Issue: Too Many Agents Running (Over Capacity)
Symptoms: More agents running than configured limit (e.g., 3/2, 5/2).
Expected Behavior: This is intentional when "Bump & Start" is used to handle urgent tasks.
Diagnosis: Check logs for bump actions:
grep "bump.*priority" hephaestus.log | tail -5
grep "bypassing limit" hephaestus.log | tail -5
If Unexpected (no bump actions):
- Check for race conditions: Multiple agents created simultaneously
- Verify should_queue_task()is called before every agent creation
- Check for code paths that bypass queue service
Resolution:
- Normal: System returns to limit naturally as agents complete
- Urgent: Manually terminate excess agents if necessary
- Prevention: Add locks around agent creation if race conditions persist
Future Enhancements
Potential improvements to the queue system:
- Queue persistence: Survive server restarts
- Queue analytics: Dashboard showing queue metrics
- Auto-scaling: Dynamically adjust max_concurrent_agents based on load
- Advanced scheduling: Time-based task execution
- Resource-based queueing: Consider CPU/memory, not just agent count
- Queue policies: Configurable priority strategies
- Task dependencies: Queue tasks waiting for parent completion
Related Documentation
- Monitoring Architecture - Guardian and Conductor agents for system health
- Validation System - Task validation flow and quality control
- Worktree Isolation - Agent workspace isolation system
- Agent Communication - Inter-agent messaging and coordination
- Task Deduplication - Preventing duplicate work across agents