Kernel Fine-Tuning
Continuous reinforcement learning for local LLMs. Every coding session becomes training data, and GRPO continuously evolves a LoRA adapter on your base model. No cloud. No cost. Your model, your improvements.
Abstract
We present a self-improving kernel architecture for autonomous coding agents. Unlike static model deployments where weights never change, 8gent's kernel fine-tuning loop turns every coding session into a reinforcement learning signal. A transparent proxy captures prompt-response-outcome triples, a free judge model (Gemini Flash) scores quality asynchronously, and GRPO training runs during idle/sleep windows via the MadMax scheduler. Improved LoRA adapters are hot-swapped into Ollama with zero downtime. The result: a model that gets measurably better at your codebase, your patterns, your tooling preferences - entirely local, entirely free.
Architecture
Three-Layer Model Architecture
Eight separates model weights into three composable layers. This design keeps upstream improvements flowing to users while preserving personal adaptations - and ensures no user data ever needs to leave their machine.
The upstream open-weight model, frozen and unmodified. Community weights pulled directly from Ollama. This layer is never touched by our training pipeline - it serves as the stable foundation that all fine-tuning builds upon.
Our centralized fine-tune, trained on internal benchmark suites and curated coding sessions. Released as versioned adapters (e.g. eight-1.2.0-q3:14b). Every user receives the same Eight LoRA - this is what makes the model "Eight" rather than vanilla Qwen.
Your personal fine-tune, trained on your coding patterns, tool preferences, and codebase conventions via the kernel loop. This adapter sits on top of the Eight LoRA and never leaves your machine. It is what makes Eight feel like your agent.
Layer Composition
Version Upgrade Flow
When a new Eight version is released (e.g. eight-1.2.0-q3:14b → eight-1.3.0-q3:14b), users are prompted to retrain their personal module on the updated base. The kernel preserves your training data, so retraining is automatic - typically one MadMax sleep window.
4-Phase Rollout
The training proxy sits between 8gent and Ollama as a transparent OpenAI-compatible proxy on port 30000. Every prompt-response pair is captured as a conversation trace without adding latency. Health checks and latency overhead monitoring ensure the proxy never degrades the coding experience.
- ✓Start/stop training proxy process
- ✓Health checks with configurable timeout
- ✓Latency overhead monitoring (direct vs proxied)
- ✓Conversation trace collection via proxy passthrough
Gemini Flash (free via OpenRouter) scores every response asynchronously using a 4-criteria PRM: execution success (40%), code quality (20%), tool efficiency (20%), and directness (20%). Scoring never blocks the agent loop - it runs fire-and-forget in the background.
- ✓4-criteria PRM scoring via Gemini Flash
- ✓Score distribution tracking (per-model, per-day)
- ✓7-day rolling trend analysis
- ✓Batch scoring for async processing
Scored responses are filtered by score range (skip trivial and perfect) and collected into GRPO training batches. When a batch reaches capacity, LoRA fine-tuning runs via the training backend. Every checkpoint is validated against the autoresearch benchmark suite before promotion.
- ✓Score-range filtering for training data quality
- ✓Automatic training trigger on full batch
- ✓Checkpoint creation and lifecycle tracking
- ✓Benchmark validation gate + auto-rollback on regression
The production loop ties everything together. The deferred training scheduler defers training to idle/sleep windows so it never interrupts active coding. Improved checkpoints are auto-promoted into the model-router experience DB. Health monitoring tracks score trends and alerts on decline.
- ✓Deferred training scheduler (sleep 23:00-07:00, idle 30min)
- ✓Auto-promotion into model-router experience DB
- ✓Health monitoring with score trend alerts
- ✓Graceful degradation when components unavailable
Base Model Selection
Start with qwen3:14b for initial RL runs (most VRAM-friendly, code-native), then graduate to qwen3.5 once the pipeline is validated.
| Model | Role | VRAM (LoRA) | Rationale |
|---|---|---|---|
| qwen3:14b | Primary | ~12 GB | Purpose-built for code, 14B sweet spot for LoRA. Most VRAM-friendly option for initial RL runs. |
| qwen3.5:latest | Graduate Target | ~18 GB | Strongest coding benchmarks, TUI default. Graduate to this once the pipeline is validated. |
| devstral:latest | Experimental | ~14 GB | Mistral code specialist. Good benchmark diversity for cross-model validation. |
Model Versioning
Every model produced by the kernel follows a strict naming convention that encodes its lineage, training generation, and parameter count. This makes checkpoints traceable, comparable, and rollback-safe across the entire fine-tuning lifecycle.
Naming Convention
| Segment | Meaning | Example |
|---|---|---|
| eight | Model family name | eight |
| major.minor.patch | SemVer training version | 1.0.0 |
| q{gen} | Base model generation (Qwen series) | q3 (Qwen 3) |
| {params} | Parameter count | 14b |
Version Bump Criteria
Version increments are determined by the Gemini Flash judge (free via OpenRouter), which scores every checkpoint against the benchmark suite. The judge evaluates execution success, code quality, tool efficiency, and directness using its 4-criteria PRM.
MadMax Scheduling
Training Windows
Training never interrupts active coding. The MadMax scheduler detects two training windows: the sleep window (23:00–07:00 local time) and the idle window (30 minutes of no user activity). When either condition is met and a training batch is ready, GRPO runs automatically.
Validation Gate
Every trained checkpoint must pass the benchmark validation gate before promotion. The autoresearch suite runs against the fine-tuned model, and scores are compared to the pre-training baseline. If any regression is detected, the checkpoint is automatically rolled back and the base model continues serving.
Training Data Sources
Safety Rails
Integration
One-liner for the Agent Loop
The entire pipeline is exposed through a single method call. After each agent response, call processTurn() - it handles scoring, buffering, scheduling, and promotion automatically. Safe to call when disabled; returns null.
Environment Setup
Enable the kernel by setting training_proxy.enabled in your project config. The proxy URL defaults to localhost:30000. When enabled, the Ollama provider automatically routes through the training proxy.
Package Structure
All four phases are implemented in the @8gent/kernel package.
| File | Phase | Purpose |
|---|---|---|
| proxy.ts | 1 | Training proxy lifecycle - start/stop, health checks, latency monitoring |
| judge.ts | 2 | Process reward model scoring - async evaluation, distributions, daily trends |
| training.ts | 3 | GRPO batch collection - score filtering, checkpoint validation, auto-rollback |
| loop.ts | 4 | Production loop - deferred training scheduler, auto-promotion, health monitoring |
| manager.ts | All | Unified entry point - reads .8gent/config.json, safe no-op when disabled |
| index.ts | - | Barrel exports |