Execution-Graded Benchmark Suite

44 benchmarks across 6 difficulty tiers and 35 professional domains. All evaluation via bun:test execution. No keyword matching.

v2.1Last updated: 2026-03-218gent-code v0.8.0

Abstract

This suite evaluates LLM code generation using SWE-Bench style execution grading. Each benchmark includes a test harness that the generated code must pass. LLM output is extracted, written to a temp directory, and executed via bun test. Combined grade: 70% execution score + 30% keyword coverage. Temperature sweep (0.3, 0.5, 0.7) selects the best result per benchmark. Iterative prompt mutation accumulates learnings across runs with deduplication (exact match + 70% word overlap) to prevent prompt bloat.

Overview

44benchmarks
566execution tests
35domains
6tiers
70/30exec / keyword

Benchmark Catalog

Methodology

Execution Grading

Each benchmark includes a bun:test harness. LLM output is extracted, written to a temp directory, and executed. Tests verify behavior, not surface patterns. Multi-file benchmarks use // filename.ts delimiters.

# Combined grade formula
score = 0.7 * execution + 0.3 * keyword
# Temperature sweep
best = max(score @ t=0.3, t=0.5, t=0.7)

Autoresearch Loop

Iterative improvement: benchmarks are executed, failures are analyzed, system prompt is mutated with learnings. Mutations accumulate across iterations with deduplication (exact match + 70% word overlap). Convergence when all benchmarks pass or no new mutations.

# Autoresearch pipeline
1. Run all benchmarks (temp sweep)
2. Analyze failures (exec errors + keywords)
3. Derive mutations (pattern-specific)
4. Append to system prompt (deduplicated)
5. Repeat until convergence or max_iter

Results

All results from local Ollama models (qwen3.5:latest, devstral:latest, qwen3:14b) on Apple M2 Max with OpenRouter free-tier fallback (gemini-2.5-flash:free). Zero API cost. Experience-based routing selects the best-performing model per domain.

Scores are best-of-3 temperatures (0.3, 0.5, 0.7). Passing threshold: 80/100.

View Artifacts Gallery

Tier 1 (Fundamentals)

Iteration 1
100
5/5

Tier 2 (Fullstack)

Iteration 1
48
1/4
Iteration 3
68
3/4
Iteration 6
71
3/4
Iteration 8
52
1/4

Tier 3 (Agentic)

Iteration 1
60
3/7
Iteration 2
55
3/7
Iteration 3
55
3/7
Iteration 8
33
1/7

Tier 4 (UI/CSS)

Iteration 1
98
8/8
Iteration 8
95
8/8

Tier 5 (Long-Horizon)

Iteration 1
59
1/5
Iteration 2
68
2/5

Tier 6 (Battle Test)

Iteration 1
69
8/15
Iteration 2
74
9/15
Iteration 8
51
4/15

Domain Coverage

Concurrency1Resource Management1Error Handling1Validation1Data Structures1Backend1Distributed Systems1Frontend Architecture1Full Stack1Agent Orchestration1ETL1Logic1Data Modeling1Information Retrieval1Transpilation1ML Infrastructure1CSS4CSS Transforms1Animation1Accessibility1Layout1Code Quality1Data Engineering2Infrastructure3Developer Tools2Security2Computer Science1Finance1Marketing2DevOps1Design Systems1Media1Music1Analytics1Consulting1

System Architecture

Model Layer (Local + Free Fallback)
qwen3.5:latestdevstral:latestqwen3:14bgemini-2.5-flash:freeExperience-Based Routing
Harness v2
Temperature Sweep (0.3, 0.5, 0.7) · Few-Shot per Category · Multi-File Extraction
Execution GraderSWE-Bench Style
Extract code blocks
Multi-file: // filename.ts
Write to /tmp
HTML: index.html
bun test (15s timeout)
Parse pass/fail
Combined Score
0.7 × exec + 0.3 × keyword · Pass threshold: 80
Autoresearch (Iterative Mutation)
Analyze failures → derive learnings → append to system prompt · Dedup: exact + 70% word overlap

Reproduction

# Clone and install
git clone https://github.com/PodJamz/8gent-code.git && cd 8gent-code
bun install
# Run single pass (all categories)
bun run benchmark:v2
# Run autoresearch loop (specific category)
CATEGORY=battle-test MAX_ITERATIONS=5 bun benchmarks/autoresearch/autoresearch-loop.ts
# Overnight runner (rotates categories, stops at 7AM)
bash benchmarks/autoresearch/overnight-abilities.sh