Execution-Graded Benchmark Suite
44 benchmarks across 6 difficulty tiers and 35 professional domains. All evaluation via bun:test execution. No keyword matching.
Abstract
This suite evaluates LLM code generation using SWE-Bench style execution grading. Each benchmark includes a test harness that the generated code must pass. LLM output is extracted, written to a temp directory, and executed via bun test. Combined grade: 70% execution score + 30% keyword coverage. Temperature sweep (0.3, 0.5, 0.7) selects the best result per benchmark. Iterative prompt mutation accumulates learnings across runs with deduplication (exact match + 70% word overlap) to prevent prompt bloat.
Overview
Benchmark Catalog
Methodology
Execution Grading
Each benchmark includes a bun:test harness. LLM output is extracted, written to a temp directory, and executed. Tests verify behavior, not surface patterns. Multi-file benchmarks use // filename.ts delimiters.
Autoresearch Loop
Iterative improvement: benchmarks are executed, failures are analyzed, system prompt is mutated with learnings. Mutations accumulate across iterations with deduplication (exact match + 70% word overlap). Convergence when all benchmarks pass or no new mutations.
Results
All results from local Ollama models (qwen3.5:latest, devstral:latest, qwen3:14b) on Apple M2 Max with OpenRouter free-tier fallback (gemini-2.5-flash:free). Zero API cost. Experience-based routing selects the best-performing model per domain.
Scores are best-of-3 temperatures (0.3, 0.5, 0.7). Passing threshold: 80/100.
View Artifacts Gallery