Kernel Fine-Tuning

Continuous reinforcement learning for local LLMs. Every coding session becomes training data, and GRPO continuously evolves a LoRA adapter on your base model. No cloud. No cost. Your model, your improvements.

Version 1.0March 20268gent Research

Abstract

We present a self-improving kernel architecture for autonomous coding agents. Unlike static model deployments where weights never change, 8gent's kernel fine-tuning loop turns every coding session into a reinforcement learning signal. A transparent proxy captures prompt-response-outcome triples, a free judge model (Gemini Flash) scores quality asynchronously, and GRPO training runs during idle/sleep windows via the MadMax scheduler. Improved LoRA adapters are hot-swapped into Ollama with zero downtime. The result: a model that gets measurably better at your codebase, your patterns, your tooling preferences - entirely local, entirely free.

Architecture

8gent TUI (Bun/Ink)

Every session generates training triples: prompt + response + outcome

Training Proxy:30000

Transparent OpenAI-compatible proxy · captures all conversation traces

Judge PRM

Gemini Flash (free via OpenRouter) · 4-criteria async scoring

score filtered (skip trivial + perfect)

GRPO Trainer

LoRA fine-tuning via training backend · runs during idle/sleep windows

checkpoint validated against benchmarks

Deploy LoRA

Hot-swap adapter into Ollama · zero downtime, auto-rollback on regression

Three-Layer Model Architecture

Eight separates model weights into three composable layers. This design keeps upstream improvements flowing to users while preserving personal adaptations - and ensures no user data ever needs to leave their machine.

L1Base Model

qwen3:14b

The upstream open-weight model, frozen and unmodified. Community weights pulled directly from Ollama. This layer is never touched by our training pipeline - it serves as the stable foundation that all fine-tuning builds upon.

L2Eight LoRA

eight-1.x-q3:14b

Our centralized fine-tune, trained on internal benchmark suites and curated coding sessions. Released as versioned adapters (e.g. eight-1.2.0-q3:14b). Every user receives the same Eight LoRA - this is what makes the model "Eight" rather than vanilla Qwen.

L3Personal LoRA

local adapter

Your personal fine-tune, trained on your coding patterns, tool preferences, and codebase conventions via the kernel loop. This adapter sits on top of the Eight LoRA and never leaves your machine. It is what makes Eight feel like your agent.

Layer Composition

Layer 1: Base Model(frozen)

qwen3:14b

LoRA merge

Layer 2: Eight LoRA(versioned)

eight-1.x-q3:14b

LoRA stack

Layer 3: Personal LoRA(local)

your adapter

Version Upgrade Flow

When a new Eight version is released (e.g. eight-1.2.0-q3:14b → eight-1.3.0-q3:14b), users are prompted to retrain their personal module on the updated base. The kernel preserves your training data, so retraining is automatic - typically one MadMax sleep window.

# On new Eight release

1. Pull new Eight LoRA from registry

2. Prompt user: "Retrain personal module?"

3. Replay stored training data on new base

4. Validate via benchmark gate

5. Hot-swap into Ollama

4-Phase Rollout

01Proxy (skills_only)

Implemented

The training proxy sits between 8gent and Ollama as a transparent OpenAI-compatible proxy on port 30000. Every prompt-response pair is captured as a conversation trace without adding latency. Health checks and latency overhead monitoring ensure the proxy never degrades the coding experience.

✓Start/stop training proxy process
✓Health checks with configurable timeout
✓Latency overhead monitoring (direct vs proxied)
✓Conversation trace collection via proxy passthrough

02Judge Scoring

Implemented

Gemini Flash (free via OpenRouter) scores every response asynchronously using a 4-criteria PRM: execution success (40%), code quality (20%), tool efficiency (20%), and directness (20%). Scoring never blocks the agent loop - it runs fire-and-forget in the background.

✓4-criteria PRM scoring via Gemini Flash
✓Score distribution tracking (per-model, per-day)
✓7-day rolling trend analysis
✓Batch scoring for async processing

03GRPO Training

Implemented

Scored responses are filtered by score range (skip trivial and perfect) and collected into GRPO training batches. When a batch reaches capacity, LoRA fine-tuning runs via the training backend. Every checkpoint is validated against the autoresearch benchmark suite before promotion.

✓Score-range filtering for training data quality
✓Automatic training trigger on full batch
✓Checkpoint creation and lifecycle tracking
✓Benchmark validation gate + auto-rollback on regression

04Production Loop (MadMax)

Implemented

The production loop ties everything together. The deferred training scheduler defers training to idle/sleep windows so it never interrupts active coding. Improved checkpoints are auto-promoted into the model-router experience DB. Health monitoring tracks score trends and alerts on decline.

✓Deferred training scheduler (sleep 23:00-07:00, idle 30min)
✓Auto-promotion into model-router experience DB
✓Health monitoring with score trend alerts
✓Graceful degradation when components unavailable

Base Model Selection

Start with qwen3:14b for initial RL runs (most VRAM-friendly, code-native), then graduate to qwen3.5 once the pipeline is validated.

Model	Role	VRAM (LoRA)	Rationale
qwen3:14b	Primary	~12 GB	Purpose-built for code, 14B sweet spot for LoRA. Most VRAM-friendly option for initial RL runs.
qwen3.5:latest	Graduate Target	~18 GB	Strongest coding benchmarks, TUI default. Graduate to this once the pipeline is validated.
devstral:latest	Experimental	~14 GB	Mistral code specialist. Good benchmark diversity for cross-model validation.

Model Versioning

Every model produced by the kernel follows a strict naming convention that encodes its lineage, training generation, and parameter count. This makes checkpoints traceable, comparable, and rollback-safe across the entire fine-tuning lifecycle.

Naming Convention

eight-{major.minor.patch}-q{gen}:{params}

Segment	Meaning	Example
eight	Model family name	eight
major.minor.patch	SemVer training version	1.0.0
q{gen}	Base model generation (Qwen series)	q3 (Qwen 3)
{params}	Parameter count	14b

Version Bump Criteria

Version increments are determined by the Gemini Flash judge (free via OpenRouter), which scores every checkpoint against the benchmark suite. The judge evaluates execution success, code quality, tool efficiency, and directness using its 4-criteria PRM.

PATCHLoRA checkpoint passes validation gate with score improvement < 5% over baseline. Bug fixes, minor prompt refinements.

MINORBenchmark score improves >= 5% on any tier, or a new domain passes for the first time. New capabilities unlocked.

MAJORBase model migration (e.g. q3 to q4), architecture change, or training pipeline overhaul. Breaking changes to adapter format.

# Example version progression

eight-1.0.0-q3:14b ← Initial LoRA on Qwen 3 14B

eight-1.0.1-q3:14b ← Patch: minor prompt refinement, +2% Tier 1

eight-1.1.0-q3:14b ← Minor: Tier 5 Security domain now passing

eight-2.0.0-q4:14b ← Major: migrated to Qwen 4 base model

MadMax Scheduling

Training Windows

Training never interrupts active coding. The MadMax scheduler detects two training windows: the sleep window(23:00–07:00 local time) and the idle window (30 minutes of no user activity). When either condition is met and a training batch is ready, GRPO runs automatically.

# MadMax schedule config

sleep_window: 23:00 – 07:00

idle_threshold: 30 minutes

batch_size: 32 scored responses

lora_rank: 32

Validation Gate

Every trained checkpoint must pass the benchmark validation gate before promotion. The autoresearch suite runs against the fine-tuned model, and scores are compared to the pre-training baseline. If any regression is detected, the checkpoint is automatically rolled back and the base model continues serving.

# Validation pipeline

1. Train LoRA checkpoint

2. Run autoresearch benchmark suite

3. Compare scores vs pre-training baseline

4. Regression? Auto-rollback

5. Improvement? Promote + log to CHANGELOG

Training Data Sources

Every session

Live coding sessions

Conversation traces

Batch after each run

Autoresearch benchmark runs

Scored solutions

High signal

Bug fix sessions

Error → fix pairs

Pattern learning

Tool call sequences

Action planning

Safety Rails

01Checkpoint before every LoRA swap - always rollback-able

02Benchmark gate - new weights must match or beat baseline on autoresearch suite

03Deferred training scheduler - training never happens during active sessions

04LoRA isolation - base model weights never modified, only adapter layers

05A/B routing - model-router can split traffic between base and fine-tuned to measure real impact

Integration

One-liner for the Agent Loop

The entire pipeline is exposed through a single method call. After each agent response, call processTurn() - it handles scoring, buffering, scheduling, and promotion automatically. Safe to call when disabled; returns null.

// After each agent response:

const score = await kernel.processTurn(

sessionId, turnIndex, model,

prompt, response

);

// Check health:

const health = kernel.getHealth();

// { healthy, trend, message }

// Get active model (base or fine-tuned):

const model = kernel.getActiveModel();

Environment Setup

Enable the kernel by setting training_proxy.enabled in your project config. The proxy URL defaults to localhost:30000. When enabled, the Ollama provider automatically routes through the training proxy.

// .8gent/config.json

{

"training_proxy": {

"enabled": true,

"proxyUrl": "http://localhost:30000",

"baseModel": "qwen3:14b"

}

# Judge model (free via OpenRouter)

OPENROUTER_API_KEY=sk-or-...

# PRM: google/gemini-2.5-flash:free

Package Structure

All four phases are implemented in the @8gent/kernel package.

File	Phase	Purpose
proxy.ts	1	Training proxy lifecycle - start/stop, health checks, latency monitoring
judge.ts	2	Process reward model scoring - async evaluation, distributions, daily trends
training.ts	3	GRPO batch collection - score filtering, checkpoint validation, auto-rollback
loop.ts	4	Production loop - deferred training scheduler, auto-promotion, health monitoring
manager.ts	All	Unified entry point - reads .8gent/config.json, safe no-op when disabled
index.ts	-	Barrel exports