Execution-graded benchmarks: methodology and results

Why execution matters

Static analysis can tell you if code has syntax errors. LLM-as-judge can tell you if code looks reasonable. Neither tells you if the code actually works.

Execution grading means we take the generated code, run it in a sandboxed environment, and check the output against expected results. The score is the percentage of test assertions that pass.

The grading pipeline

Each benchmark follows a consistent structure:

Prompt — a natural language description of the task
Fixture — any data, files, or context the task requires
Expected output — deterministic, testable criteria
Grader — a function that scores 0-100 based on execution results

The grader is the key innovation. It does not use string matching or regex on the source code. It runs the code and checks behaviour.

Benchmark tiers

We organise benchmarks into difficulty tiers:

Tier 1 (Easy): Single-file bug fixes and feature implementations. Most capable models score 80-100 consistently.

Tier 2 (Hard): Multi-file fullstack tasks. Scores vary wildly based on the model's ability to coordinate across files.

Tier 3 (Brutal): Agentic reasoning tasks — tool chaining, dependency planning, state machine construction. Even the best models rarely score above 70 on first pass.

Tier 4 (Visual): UI/CSS benchmarks verified structurally. We check for specific CSS properties, DOM structure, and responsive breakpoints.

Tier 5 (Battle-test): Production-grade multi-file tasks modelled after real SaaS features. 250+ execution tests across 15 benchmarks.

Key findings

Temperature matters more than most people think. We observed a single benchmark (TC001) go from a score of 24 to 93 just by changing temperature from 0.5 to 0.7.

Mutation interference is real. A learning that improves one benchmark can regress another. We now scope mutations per-category to prevent cross-contamination.

The best single-pass scores across our benchmark suite average around 73% for agentic tasks and 79% for fullstack. There is significant room for improvement.

---

View the live benchmark dashboard | Clone 8gent-code | Follow @8gentapp on X