Running SWEBench-Verified Benchmark

Help us evaluate Hephaestus on the SWEBench-Verified benchmark! This guide explains how community contributors can run benchmark instances and contribute results.

🎯 Overview

We're running a large-scale evaluation of Hephaestus on the SWEBench-Verified benchmark, which contains real-world GitHub issues from popular Python repositories. We need your help to complete this evaluation!

Your contributions will be credited in our final report. ✨

📋 Prerequisites

Required

Claude Code Subscription - You'll need an active Claude Code subscription (Sonnet-4.5)
- This is the AI agent that will solve the issues
- IMPORTANT: You must use Claude Sonnet-4.5 (not older models)
Docker (for Qdrant vector database)

Provided by Us

OpenRouter API Key - We'll provide you with an API key for gpt-oss access
- This is used for the orchestration system
- Keys will be distributed via Discord

🚀 Getting Started

1. Join Discord for Coordination

Join our Discord channel to coordinate instance assignments and avoid duplicates:

Discord: https://discord.gg/FyrC4fpS

Announce which instances you're planning to run
Get your OpenRouter API key
Ask questions and get support
Share your progress

2. Setup Hephaestus

Follow the Getting Started Guide to:

Clone the repository
Install dependencies
Start required services (Qdrant, MCP server, monitoring)

Additional requirement: Install the datasets library for instance generation:

pip install datasets

🎲 Generating Your Instance Batch

1. Generate Instances to Run

Use the provided script to generate a list of uncompleted instances:

# Generate 10 instances to run
python generate_instances.py --count 10

# Or generate more instances
python generate_instances.py --count 20

This script will:

Fetch the SWEBench-Verified dataset (cached after first run)
Check which instances are already completed in swebench_results/
Generate an instances.yaml file with N uncompleted instances

2. Review Your Assigned Instances

The script will output the instances it selected. Announce these in Discord to avoid duplicates!

Example output:

[Success] Created instances.yaml with 10 instances:

[Info] Selected instances:
  1. django__django-13109
  2. psf__requests-1921
  3. astropy__astropy-7166
  ...

🏃 Running the Benchmark

Start the Batch Run

Run your instances with the batch runner:

python run_swebench_batch.py \
    --instances instances.yaml \
    --timeout 3000 \
    --poll-interval 30

Parameters:

--instances: Your generated YAML file
--timeout: Max time per instance in seconds (3000s = 50 minutes)
--poll-interval: How often to check for results (30s recommended)

What Happens During the Run

The batch runner will:

Process each instance sequentially
Create a unique workspace for each instance
Run the Hephaestus workflow to solve the issue
Monitor for validated results via API polling
Collect solution patches and logs
Clean up between instances

Monitor Progress

You'll see output like:

======================================================================
INSTANCE 1/10: django__django-13109
======================================================================

[Workflow] Starting: django__django-13109
[Workflow] Workspace: /tmp/swebench_django__django-13109_abc123

[Monitor] Starting to monitor for validated result...
[Monitor] Polling interval: 30s
[Monitor] Timeout: 3000s

[Poll] No validated result yet. Waiting 30s... (timeout in 2970s)
...

[Result] ✅ Found validated result!
[Collect] ✅ Results collected successfully

Expected Runtime

The system has a timeout of 50 minutes per instance (configurable)
Some instances may complete faster, others may take the full timeout
Runtime varies significantly based on issue complexity

Tip: Run batches overnight or during times when you won't need your machine.

Watch Hephaestus in Action

Want to see what's happening under the hood? You can spin up the frontend to watch Hephaestus solve SWEBench problems in real-time!

# In a new terminal, start the frontend
cd frontend
npm install
npm run dev

Then open your browser to http://localhost:5173 to access the Hephaestus dashboard. You'll be able to:

Monitor task progress and agent activity
View submitted solutions and validation results
Track which instances are being solved
See detailed execution logs

Results Page

📦 Results Collection

What Gets Saved

Results are saved to swebench_results/<instance_id>/:

solution.patch - The generated fix
hephaestus.db - Complete execution logs
metadata.json - Timing and status information
result_metadata.json - Detailed result data

Submitting Your Results

Once your batch completes, submit your results via a Pull Request so your name gets credited properly!

Create a new branch:

git checkout -b swebench-results-yourname

Commit your results:

git add swebench_results/
git commit -m "Add SWEBench results - [Your Name] - [Instance Count] instances"

Push your branch:

git push origin swebench-results-yourname

Create a Pull Request:
- Go to the GitHub repository
- Create a PR from your branch to main
- Title: "SWEBench Results - [Your Name] - [Instance Count] instances"
- In the description, include:
  - List of instances you ran
  - Any issues or interesting findings
  - Your preferred name/username for credit
Announce in Discord:
- Let everyone know you've submitted a PR
- Share the PR link
- Report any instances that failed or had issues

This ensures your contribution is properly tracked and you get credited in the final report! 🎖️

📊 Tracking Progress

Check Completed Instances

# List completed instances
ls swebench_results/

# Count completed instances
ls swebench_results/ | wc -l

View Batch Statistics

The batch runner shows final statistics:

======================================================================
BATCH RUN COMPLETE
======================================================================

Total instances: 10
Completed:       8 (80.0%)
Failed:          2 (20.0%)

Results saved to: ./swebench_results

⚠️ Troubleshooting

Common Issues

Issue: "datasets library not found"

pip install datasets

Issue: "Cannot connect to Qdrant"

# Make sure Docker is running
docker ps | grep qdrant

# Restart Qdrant if needed
docker run -d -p 6333:6333 qdrant/qdrant

Issue: "API key not found"

Check your .env file exists and has correct keys
Make sure you sourced the environment variables

Issue: Instance taking too long

The system has a timeout (50 minutes default)
Some instances are genuinely harder - this is normal
Failed instances will be marked in the final stats

Getting Help

Discord: https://discord.gg/FyrC4fpS - Ask questions, get support
GitHub Issues: Report bugs or problems
Documentation: See other guides in website/docs/guides/

🎖️ Recognition

All contributors will be credited in our final SWEBench evaluation report!

We'll acknowledge:

Your name/username
Number of instances you ran
Any special contributions or findings

Thank you for helping advance autonomous AI agent research! 🚀

📚 Additional Resources

🔄 Running Multiple Batches

Want to contribute more? After completing a batch:

Generate new instances (script auto-excludes completed ones):
```
python generate_instances.py --count 10 --output batch2.yaml
```

Run the new batch:

python run_swebench_batch.py --instances batch2.yaml --timeout 3000 --poll-interval 30

Submit results again!

Questions? Join us on Discord! 🎮

🎯 Overview​

📋 Prerequisites​

Required​

Provided by Us​

🚀 Getting Started​

1. Join Discord for Coordination​

2. Setup Hephaestus​

🎲 Generating Your Instance Batch​

1. Generate Instances to Run​

2. Review Your Assigned Instances​

🏃 Running the Benchmark​

Start the Batch Run​

What Happens During the Run​

Monitor Progress​

Expected Runtime​

Watch Hephaestus in Action​

📦 Results Collection​

What Gets Saved​

Submitting Your Results​

📊 Tracking Progress​

Check Completed Instances​

View Batch Statistics​

⚠️ Troubleshooting​

Common Issues​

Getting Help​

🎖️ Recognition​

📚 Additional Resources​

🔄 Running Multiple Batches​