Execution-Graded Benchmark Suite

44 benchmarks across 6 difficulty tiers and 35 professional domains. All evaluation via bun:test execution. No keyword matching.

v2.1Last updated: 2026-03-218gent-code v0.8.0

Abstract

This suite evaluates LLM code generation using SWE-Bench style execution grading. Each benchmark includes a test harness that the generated code must pass. LLM output is extracted, written to a temp directory, and executed via bun test. Combined grade: 70% execution score + 30% keyword coverage. Temperature sweep (0.3, 0.5, 0.7) selects the best result per benchmark. Iterative prompt mutation accumulates learnings across runs with deduplication (exact match + 70% word overlap) to prevent prompt bloat.

Overview

44benchmarks

566execution tests

35domains

6tiers

70/30exec / keyword

Benchmark Catalog

Methodology

Execution Grading

Each benchmark includes a bun:test harness. LLM output is extracted, written to a temp directory, and executed. Tests verify behavior, not surface patterns. Multi-file benchmarks use // filename.ts delimiters.

# Combined grade formula

score = 0.7 * execution + 0.3 * keyword

# Temperature sweep

best = max(score @ t=0.3, t=0.5, t=0.7)

Autoresearch Loop

Iterative improvement: benchmarks are executed, failures are analyzed, system prompt is mutated with learnings. Mutations accumulate across iterations with deduplication (exact match + 70% word overlap). Convergence when all benchmarks pass or no new mutations.

# Autoresearch pipeline

1. Run all benchmarks (temp sweep)

2. Analyze failures (exec errors + keywords)

3. Derive mutations (pattern-specific)

4. Append to system prompt (deduplicated)

5. Repeat until convergence or max_iter

Results

All results from local Ollama models (qwen3.5:latest, devstral:latest, qwen3:14b) on Apple M2 Max with OpenRouter free-tier fallback (gemini-2.5-flash:free). Zero API cost. Experience-based routing selects the best-performing model per domain.

Scores are best-of-3 temperatures (0.3, 0.5, 0.7). Passing threshold: 80/100.

View Artifacts Gallery

Tier 1 (Fundamentals)

Iteration 1

100

5/5

Tier 2 (Fullstack)

Iteration 1

1/4

Iteration 3

3/4

Iteration 6

3/4

Iteration 8

1/4

Tier 3 (Agentic)

Iteration 1

3/7

Iteration 2

3/7

Iteration 3

3/7

Iteration 8

1/7

Tier 4 (UI/CSS)

Iteration 1

8/8

Iteration 8

8/8

Tier 5 (Long-Horizon)

Iteration 1

1/5

Iteration 2

2/5

Tier 6 (Battle Test)

Iteration 1

8/15

Iteration 2

9/15

Iteration 8

4/15

Domain Coverage

Concurrency1Resource Management1Error Handling1Validation1Data Structures1Backend1Distributed Systems1Frontend Architecture1Full Stack1Agent Orchestration1ETL1Logic1Data Modeling1Information Retrieval1Transpilation1ML Infrastructure1CSS4CSS Transforms1Animation1Accessibility1Layout1Code Quality1Data Engineering2Infrastructure3Developer Tools2Security2Computer Science1Finance1Marketing2DevOps1Design Systems1Media1Music1Analytics1Consulting1

System Architecture

Model Layer (Local + Free Fallback)

qwen3.5:latest→devstral:latest→qwen3:14b→gemini-2.5-flash:freeExperience-Based Routing

Harness v2

Temperature Sweep (0.3, 0.5, 0.7) · Few-Shot per Category · Multi-File Extraction

Execution GraderSWE-Bench Style

Extract code blocks

Multi-file: // filename.ts

Write to /tmp

HTML: index.html

bun test (15s timeout)

Parse pass/fail

Combined Score

0.7 × exec + 0.3 × keyword · Pass threshold: 80

Autoresearch (Iterative Mutation)

Analyze failures → derive learnings → append to system prompt · Dedup: exact + 70% word overlap

Reproduction

# Install

npm install -g @8gi-foundation/8gent-code

# Or from source

git clone https://github.com/8gi-foundation/8gent-code.git && cd 8gent-code && bun install

# Run single pass (all categories)

bun run benchmark:v2

# Run autoresearch loop (specific category)

CATEGORY=battle-test MAX_ITERATIONS=5 bun benchmarks/autoresearch/autoresearch-loop.ts

# Overnight runner (rotates categories, stops at 7AM)

bash benchmarks/autoresearch/overnight-abilities.sh