The problem with vibes-based evaluation
Most AI coding tools demo well. You type a prompt, the model spits out code, and it looks plausible. But looking plausible and actually running are different things.
We watched teams evaluate AI agents by reading the output. That is the equivalent of grading a math test by checking if the handwriting is neat. It tells you nothing about correctness.
Harness-first development
At 8gent, the harness came before the agent. We built the grading infrastructure first because without it, you cannot tell if changes to the system prompt, model selection, or retrieval pipeline actually improve output quality.
The harness does three things:
- Executes generated code in a sandboxed environment
- Runs deterministic test suites against the output
- Scores results numerically so you can track regressions
Every benchmark has a fixture, an expected output, and a grading function. No human in the loop. No subjective quality assessments.
What this means in practice
When we change the system prompt, we run the full benchmark suite. If the average score drops, the change does not ship. This is basic software engineering discipline applied to prompt engineering.
The harness currently covers 39 benchmarks across 7 categories: bug-fixing, file manipulation, feature implementation, fullstack, agentic reasoning, UI design, and production battle-tests.
The agent is a replaceable component
The harness is the product. The agent — the model, the prompt, the retrieval — is a configuration of the harness. When a better model appears, we swap it in and run the benchmarks. If scores improve, we ship it. If they do not, we do not.
This is the opposite of how most AI companies operate. They build the agent first and then scramble to evaluate it. We evaluate first and let the evaluation drive what gets built.
---
View the benchmark suite | Clone 8gent-code | Follow @8gentapp on X