Kernel Fine-Tuning

Continuous reinforcement learning for local LLMs. Every coding session becomes training data, and GRPO continuously evolves a LoRA adapter on your base model. No cloud. No cost. Your model, your improvements.

Version 1.0March 20268gent Research

Abstract

We present a self-improving kernel architecture for autonomous coding agents. Unlike static model deployments where weights never change, 8gent's kernel fine-tuning loop turns every coding session into a reinforcement learning signal. A transparent proxy captures prompt-response-outcome triples, a free judge model (Gemini Flash) scores quality asynchronously, and GRPO training runs during idle/sleep windows via the MadMax scheduler. Improved LoRA adapters are hot-swapped into Ollama with zero downtime. The result: a model that gets measurably better at your codebase, your patterns, your tooling preferences - entirely local, entirely free.

Architecture

8gent TUI (Bun/Ink)
Every session generates training triples: prompt + response + outcome
Training Proxy:30000
Transparent OpenAI-compatible proxy · captures all conversation traces
Judge PRM
Gemini Flash (free via OpenRouter) · 4-criteria async scoring
score filtered (skip trivial + perfect)
GRPO Trainer
LoRA fine-tuning via training backend · runs during idle/sleep windows
checkpoint validated against benchmarks
Deploy LoRA
Hot-swap adapter into Ollama · zero downtime, auto-rollback on regression

Three-Layer Model Architecture

Eight separates model weights into three composable layers. This design keeps upstream improvements flowing to users while preserving personal adaptations - and ensures no user data ever needs to leave their machine.

L1Base Model
qwen3:14b

The upstream open-weight model, frozen and unmodified. Community weights pulled directly from Ollama. This layer is never touched by our training pipeline - it serves as the stable foundation that all fine-tuning builds upon.

L2Eight LoRA
eight-1.x-q3:14b

Our centralized fine-tune, trained on internal benchmark suites and curated coding sessions. Released as versioned adapters (e.g. eight-1.2.0-q3:14b). Every user receives the same Eight LoRA - this is what makes the model "Eight" rather than vanilla Qwen.

L3Personal LoRA
local adapter

Your personal fine-tune, trained on your coding patterns, tool preferences, and codebase conventions via the kernel loop. This adapter sits on top of the Eight LoRA and never leaves your machine. It is what makes Eight feel like your agent.

Layer Composition

Layer 1: Base Model(frozen)
qwen3:14b
LoRA merge
Layer 2: Eight LoRA(versioned)
eight-1.x-q3:14b
LoRA stack
Layer 3: Personal LoRA(local)
your adapter

Version Upgrade Flow

When a new Eight version is released (e.g. eight-1.2.0-q3:14beight-1.3.0-q3:14b), users are prompted to retrain their personal module on the updated base. The kernel preserves your training data, so retraining is automatic - typically one MadMax sleep window.

# On new Eight release
1. Pull new Eight LoRA from registry
2. Prompt user: "Retrain personal module?"
3. Replay stored training data on new base
4. Validate via benchmark gate
5. Hot-swap into Ollama

4-Phase Rollout

01Proxy (skills_only)
Implemented

The training proxy sits between 8gent and Ollama as a transparent OpenAI-compatible proxy on port 30000. Every prompt-response pair is captured as a conversation trace without adding latency. Health checks and latency overhead monitoring ensure the proxy never degrades the coding experience.

  • Start/stop training proxy process
  • Health checks with configurable timeout
  • Latency overhead monitoring (direct vs proxied)
  • Conversation trace collection via proxy passthrough
02Judge Scoring
Implemented

Gemini Flash (free via OpenRouter) scores every response asynchronously using a 4-criteria PRM: execution success (40%), code quality (20%), tool efficiency (20%), and directness (20%). Scoring never blocks the agent loop - it runs fire-and-forget in the background.

  • 4-criteria PRM scoring via Gemini Flash
  • Score distribution tracking (per-model, per-day)
  • 7-day rolling trend analysis
  • Batch scoring for async processing
03GRPO Training
Implemented

Scored responses are filtered by score range (skip trivial and perfect) and collected into GRPO training batches. When a batch reaches capacity, LoRA fine-tuning runs via the training backend. Every checkpoint is validated against the autoresearch benchmark suite before promotion.

  • Score-range filtering for training data quality
  • Automatic training trigger on full batch
  • Checkpoint creation and lifecycle tracking
  • Benchmark validation gate + auto-rollback on regression
04Production Loop (MadMax)
Implemented

The production loop ties everything together. The deferred training scheduler defers training to idle/sleep windows so it never interrupts active coding. Improved checkpoints are auto-promoted into the model-router experience DB. Health monitoring tracks score trends and alerts on decline.

  • Deferred training scheduler (sleep 23:00-07:00, idle 30min)
  • Auto-promotion into model-router experience DB
  • Health monitoring with score trend alerts
  • Graceful degradation when components unavailable

Base Model Selection

Start with qwen3:14b for initial RL runs (most VRAM-friendly, code-native), then graduate to qwen3.5 once the pipeline is validated.

ModelRoleVRAM (LoRA)Rationale
qwen3:14bPrimary~12 GBPurpose-built for code, 14B sweet spot for LoRA. Most VRAM-friendly option for initial RL runs.
qwen3.5:latestGraduate Target~18 GBStrongest coding benchmarks, TUI default. Graduate to this once the pipeline is validated.
devstral:latestExperimental~14 GBMistral code specialist. Good benchmark diversity for cross-model validation.

Model Versioning

Every model produced by the kernel follows a strict naming convention that encodes its lineage, training generation, and parameter count. This makes checkpoints traceable, comparable, and rollback-safe across the entire fine-tuning lifecycle.

Naming Convention

eight-{major.minor.patch}-q{gen}:{params}
SegmentMeaningExample
eightModel family nameeight
major.minor.patchSemVer training version1.0.0
q{gen}Base model generation (Qwen series)q3 (Qwen 3)
{params}Parameter count14b

Version Bump Criteria

Version increments are determined by the Gemini Flash judge (free via OpenRouter), which scores every checkpoint against the benchmark suite. The judge evaluates execution success, code quality, tool efficiency, and directness using its 4-criteria PRM.

PATCHLoRA checkpoint passes validation gate with score improvement < 5% over baseline. Bug fixes, minor prompt refinements.
MINORBenchmark score improves >= 5% on any tier, or a new domain passes for the first time. New capabilities unlocked.
MAJORBase model migration (e.g. q3 to q4), architecture change, or training pipeline overhaul. Breaking changes to adapter format.
# Example version progression
eight-1.0.0-q3:14b ← Initial LoRA on Qwen 3 14B
eight-1.0.1-q3:14b ← Patch: minor prompt refinement, +2% Tier 1
eight-1.1.0-q3:14b ← Minor: Tier 5 Security domain now passing
eight-2.0.0-q4:14b ← Major: migrated to Qwen 4 base model

MadMax Scheduling

Training Windows

Training never interrupts active coding. The MadMax scheduler detects two training windows: the sleep window (23:00–07:00 local time) and the idle window (30 minutes of no user activity). When either condition is met and a training batch is ready, GRPO runs automatically.

# MadMax schedule config
sleep_window: 23:00 – 07:00
idle_threshold: 30 minutes
batch_size: 32 scored responses
lora_rank: 32

Validation Gate

Every trained checkpoint must pass the benchmark validation gate before promotion. The autoresearch suite runs against the fine-tuned model, and scores are compared to the pre-training baseline. If any regression is detected, the checkpoint is automatically rolled back and the base model continues serving.

# Validation pipeline
1. Train LoRA checkpoint
2. Run autoresearch benchmark suite
3. Compare scores vs pre-training baseline
4. Regression? Auto-rollback
5. Improvement? Promote + log to CHANGELOG

Training Data Sources

Every session
Live coding sessions
Conversation traces
Batch after each run
Autoresearch benchmark runs
Scored solutions
High signal
Bug fix sessions
Error → fix pairs
Pattern learning
Tool call sequences
Action planning

Safety Rails

01Checkpoint before every LoRA swap - always rollback-able
02Benchmark gate - new weights must match or beat baseline on autoresearch suite
03Deferred training scheduler - training never happens during active sessions
04LoRA isolation - base model weights never modified, only adapter layers
05A/B routing - model-router can split traffic between base and fine-tuned to measure real impact

Integration

One-liner for the Agent Loop

The entire pipeline is exposed through a single method call. After each agent response, call processTurn() - it handles scoring, buffering, scheduling, and promotion automatically. Safe to call when disabled; returns null.

// After each agent response:
const score = await kernel.processTurn(
sessionId, turnIndex, model,
prompt, response
);
// Check health:
const health = kernel.getHealth();
// { healthy, trend, message }
// Get active model (base or fine-tuned):
const model = kernel.getActiveModel();

Environment Setup

Enable the kernel by setting training_proxy.enabled in your project config. The proxy URL defaults to localhost:30000. When enabled, the Ollama provider automatically routes through the training proxy.

// .8gent/config.json
{
"training_proxy": {
"enabled": true,
"proxyUrl": "http://localhost:30000",
"baseModel": "qwen3:14b"
}
}
# Judge model (free via OpenRouter)
OPENROUTER_API_KEY=sk-or-...
# PRM: google/gemini-2.5-flash:free

Package Structure

All four phases are implemented in the @8gent/kernel package.

FilePhasePurpose
proxy.ts1Training proxy lifecycle - start/stop, health checks, latency monitoring
judge.ts2Process reward model scoring - async evaluation, distributions, daily trends
training.ts3GRPO batch collection - score filtering, checkpoint validation, auto-rollback
loop.ts4Production loop - deferred training scheduler, auto-promotion, health monitoring
manager.tsAllUnified entry point - reads .8gent/config.json, safe no-op when disabled
index.ts-Barrel exports