This is the first daily dev log. I am going to write these every day, no matter how messy the session was. Today was not messy. Today was violent.
The overnight benchmarks
I set the autoresearch loop running before bed. Three categories: long-horizon battle-tests, agentic reasoning, and fullstack. 39 benchmarks total, iterating with mutations.
Ollama crashed at 3am. I know this because Eight logged it, restarted the daemon, and resumed from the last checkpoint. One benchmark — CB001 (circuit breaker pattern) — jumped from 17 to 96 across iterations. Another — TC001 (task coordinator) — regressed from 93 to 26 because a mutation that helped CB001 poisoned TC001's context. This is the core problem with autoresearch: mutation interference. A learning that helps one benchmark can destroy another.
Current best scores across agentic tier: TC001: 93, DP001: 56, RE001: 41, SD001: 72, AR001: 72, CB001: 96, MR001: 84. Average best: 73. Not great. Not terrible.
8gent v0.8.0
Tagged and pushed. Eight core abilities documented. README, CHANGELOG, and SOUL.md all updated. The version number is aspirational — the agent is maybe a 0.3 in terms of what I want it to be. But the harness is real, the benchmarks are real, and the scores are trackable. That is the point.
The Karpathy session
Watched Andrej Karpathy's No Priors interview. Took notes. Extracted 10 architectural patterns he described. Then Eight and I built 8 of them. Seven agents running simultaneously across the codebase:
- Ability scorecards — each skill gets a measurable score, tracked over time
- Meta-optimizer — watches benchmark runs, suggests prompt mutations
- Macro actions — compound tool calls that chain together (like Unix pipes for agents)
- Actuators — typed interfaces between the agent and external systems
- Throughput tracker — tokens/sec, latency, cost per benchmark
- Curriculum skills — benchmarks ordered by difficulty, agent levels up
- Persona mutation — system prompt personality evolves based on task type
- Telegram portal — inline keyboards, callback queries,
/runand/panelcommands
2,979 lines across 15 files. The run() tool alone — a single unified CLI interface inspired by the agent-clip pattern from the ex-Manus lead — handles cat, ls, grep, write, tree, and Unix pipe chains. Two-layer architecture: raw execution, then LLM presentation with binary guards and overflow protection.
KittenTTS and the voice that changed everything
I found the KittenTTS repo while looking for child-friendly TTS options. Tested it. A voice called "Kiki" came through and it was immediately obvious — this was the voice. Warm, clear, slightly playful. Not robotic, not patronizing.
Wired Kiki into Nick OS and 8gent Jr across every communication surface: AAC output, story narration, vocabulary practice, bedtime routines. Voice adoption went from 20% to 100%. My son actually wanted the device to talk. That is the only metric that matters.
Nick OS: The SLT session that became a system
Julie, Nick's speech and language therapist, wrote a 2,400-line assessment report. Most parents read it, feel overwhelmed, and file it. I fed it to Eight.
What came out:
- Supercore 50 — the 50 most functional core words, prioritized by frequency and motor accessibility
- Fitzgerald Key colors — color-coded word categories (people = yellow, actions = green, descriptors = blue)
- Motor planning lock — consistent button positions so muscle memory builds
- Visual Scene Displays — photograph-based scenes where tapping objects produces language
- GLP stage system — Gestalt Language Processing stages 1 through 4, with protected echolalia at stages 1-2
- 500+ vocabulary items — organized by context (home, school, playground, meals)
- Emotional playlists — music matched to regulation states
The GLP engine is the piece I am most proud of. At Stage 1, a child communicates in whole phrases ("let's go!" = I want to leave). The system protects these gestalts instead of breaking them into words. At Stage 3-4, it begins offering mix-and-match word combinations. Most AAC apps get this wrong — they force single-word selection from day one. That is like teaching someone a foreign language by starting with grammar rules instead of phrases.
The intelligence layer
Three components that make the AAC system learn:
- GLP-aware sentence engine — knows which stage the child is at, generates appropriate output
- Personalization store — tracks which words are used, which are ignored, which combinations emerge
- Session logger — exports data in SLT-readable format with automated recommendations
Julie can now open a dashboard and see exactly which words Nick used this week, which visual scenes he engaged with, and where the system thinks he is on the GLP continuum. No more guessing from a 45-minute clinic session.
Physics simulator
Extracted the core concept from a Physics-Notebook repo. Rebuilt it in 211 lines of pure math. Three modes: particle playground (gravity, collision, trails), projectile launcher (angle, velocity, air resistance), and pendulum (damping, phase portraits). No dependencies. No framework. Just requestAnimationFrame and trigonometry.
Nick will not use this for years. But his older cousins will, and the architecture is there for when curiosity arrives.
The real USP: hyper-personalization
This is the thesis. Every AAC app gives you the same 400 symbols. Every child gets the same grid. That is insane. My son loves Bluey, trains, and a very specific shade of orange. His AAC board should reflect that.
What we built: SchoolTube replaces YouTube (curated, safe, educational). The AI watches what content the child engages with, then generates custom AAC cards in their favorite visual style. It creates vocabulary categories from what they actually care about, not what a committee decided 10 years ago.
The ARASAAC symbol IDs were broken for about 2 hours. Half the generated cards pointed to nonexistent assets. Fixed by falling back to a local symbol cache with fuzzy matching. Also caught and killed a purple gradient that violated the design system. Small things, but they compound.
FoodstackOS SEO primitives
Side quest. Extracted 5 SEO skills from research and turned them into deterministic TypeScript primitives — keyword density, meta tag generation, structured data, internal linking, readability scoring. Wired into the FoodstackOS context injection system so every page gets baseline SEO without manual intervention.
This blog
It was broken. The data layer was not wired up, the [slug] routes were returning 404s, and the MDX parser was choking on frontmatter arrays. Fixed all of it. Wrote 4 seed posts. You are reading the output.
The numbers
| Metric | Count | |--------|-------| | Lines shipped | ~10,000+ | | Repos touched | 4 | | Agents run in parallel | 7 (peak) | | Benchmark categories | 3 overnight | | TTS voices tested | 6 (Kiki won) | | SLT report lines processed | 2,400 | | Vocabulary items generated | 500+ | | Physics simulator modes | 3 | | SEO primitives extracted | 5 | | Blog posts written | 5 (including this one) | | Hours | ~18 |
Tomorrow
Per-category mutation scoping for autoresearch — mutations should only affect benchmarks in their own category. The interference problem is the biggest bottleneck. Also: Nick OS field testing with Julie's new vocabulary set, and the first real deployment of the Telegram portal to see if inline keyboards feel right for agent control.
Day 1 done. The overnight session shipped more than some teams ship in a sprint. That is not a brag — it is a proof point. One builder and one very opinionated AI can move fast when the harness keeps them honest.