What Five Agents Actually Do All Day
I’ve been running AI agents at home for a few months. No formal training, no lab budget, no peer review process. What I do have is a lot of logs.
This is what happens when you stop eyeballing whether things are working and actually measure them.
The Setup
Five always-on agents — Kato, CJ, Mike, Morty, Sabrina — a Beelink EQi12 home server running everything, and until two weeks ago, a CPU handling all local inference. The model throughout: Gemma 4 26B Q4_K_M on llama.cpp.
I’m not going to pretend I knew what I was doing when I started. I stood these things up piece by piece, mostly late at night, mostly by reading error messages. The benchmarks came later, when I got tired of guessing.
Inference Speed: The CPU Era Was Borderline Usable
Seventeen daily logs from the home server, April 2–18. The numbers:
Average generation: 10.49 tok/s
Range: 5.0 – 11.7 tok/s, with low outliers on high-load days
Time to first token: 616ms on a good day, over 2 seconds when the machine was sweating
Swap used: 4.6 – 7.7 GiB every single day
That last one is the problem. A 32GB machine running a 15GB model and still hitting 7GB of swap means something else was also using memory — other agents, OS overhead, the fact that this box runs everything. It worked, but barely. On bad days you could feel it.
Then the RTX 5060 Ti arrived.
One benchmark run today: 74 tok/s. Same model, same weights, 15.4GB in VRAM, 51°C under load, zero swap. That’s 6.6x faster. The slow days on CPU feel prehistoric now.
Tonight I’m installing a second 5060 Ti. Matched pair, 32GB VRAM combined. Expected throughput: 140–150 tok/s. I’ll update the benchmarks page when I have real numbers.
Mike’s Memory: Honest About What Broke
Mike is my research agent. Always on, running autonomous research threads, posting to IRC and Moltbook, maintaining a long-term memory of everything we’ve talked about. Testing whether that memory actually works meant running LongMemEval — a benchmark that stuffs 500+ messages of synthetic conversation history into an agent and asks factual recall questions about them.
The results have a story arc.
Baseline run (no context injection): 0/25. I ran the benchmark without actually injecting the history first. Mike answered “I don’t know” to every single question, which is correct behavior — he had no data. Not a failure, just what 0% looks like when you haven’t fed the system yet.
Best context-window run: 21/25 (84%). When the full 550-turn history is loaded into context, Mike gets 4 out of 5 right consistently. The failures were mostly facts buried deep in the conversation — the model retrieved the wrong session or just couldn’t find the needle in 490k characters of text.
The extraction pipeline: completely broken. The smarter approach — extract facts first, store them, then answer — never worked cleanly. The extraction model (glm-4.7-flash) returned empty JSON responses when given 80k+ character batches. Silently. No error thrown, no facts stored, just nothing. The context-window mode works because it sidesteps this entirely: it brute-forces the full history directly. That’s fine at 500 turns. At 50,000 turns it won’t be.
Average query latency in context-window mode: 84–117 seconds. That’s not a typo. Loading 500 turns of conversation and reasoning over it takes time. On the new GPU it’ll be faster. On two GPUs, faster still. But the extraction problem is architectural, not hardware. No amount of VRAM fixes a pipeline that silently drops data.
Orchestra: 110/110
Orchestra is a tool I built for capturing and compiling AI conversations into a wiki. The test suite runs in 0.13 seconds and passes clean. This one isn’t interesting to write about because it works, which is exactly what I want from infrastructure.
Frank: The Benchmark That Hasn’t Run Yet
I built a benchmarking harness called Frank. Five test categories, tool-use chains, self-correction tests, persona smoke tests. It’s done. I haven’t run it yet because the CPU numbers weren’t worth publishing. Testing tool-use latency at 10 tok/s tells you nothing useful. The first real run happens after the dual-GPU install tonight.
I’m publishing this page before that run exists because that’s the point. Building in public means showing the gaps too.
Raw Data
Everything is linked from the benchmarks page: daily inference logs as TSV and JSON, LongMemEval JSONL files for both the baseline (0/25) and best run (21/25). Pull them, run your own analysis, tell me if I’m reading my own numbers wrong.


