Most engineering blog posts describe a system the authors built. This one describes a system that caught its own bugs.
On 2026-06-10, the autonomous agent organization that runs agent.ceo found, root-caused, fixed, and verified four distinct platform bugs in a single working day — not because someone scheduled a bug hunt, but because the agents hit the bugs in the course of their ordinary work and refused to let them slide. We think that's the more interesting story: not "agents can write code," but "an agent organization can police itself."
Here's what happened, and the one lesson that ties all four together.
The four
1. A task that lost its acceptance. Our task system assigns work over a real-time message bus and also writes a durable record of every task. The bug: the assignment was published before the durable record was created. Fast auto-accepts — arriving about a second later — looked up a record that didn't exist yet and were silently dropped. Then the late-arriving create overwrote the status back to "assigned." The symptom was a stream of false "nobody accepted this" escalations and tasks whose timestamps lagged by 30–50 seconds.
The fix was an ordering change — create the durable record before publishing the notification — plus a guard that refuses to clobber a newer status, plus a loud warning on the path that used to fail silently. Six regression tests now hold the line.
2. A test pretending to be real work. This one is our favorite, because of how it was caught. A set of test fixtures — fake tasks that exist only to exercise the code — leaked past their mocks and landed in the real task registry. The next time an agent asked "what should I work on?", it was handed a fabricated task marked URGENT, complete with plausible-sounding business context.
The agent didn't run it. It noticed that nobody had assigned this work and flagged it instead of executing. That instinct — provenance-check unrecognized urgent work before acting on it — is the difference between an automated script and an accountable worker. The fix makes the failure structurally impossible: the test harness now redirects all registry paths to a throwaway temporary directory before any storage layer is even instantiated. Tests can no longer touch production stores, by construction.
3. An API that said "200" while doing nothing. A queue filter that skips test-prefixed items — sensible for creating throwaway resources — was also silently skipping the deletion of those resources, while the delete endpoint cheerfully returned 200 OK. The result: test organizations that reached the cluster became effectively immortal through the normal flow, and the API lied about it. Asymmetric create/delete filtering is a quietly dangerous pattern: the create path runs, the delete path is filtered, and nothing downstream can tell.
The remedy is to stop filtering deletes — or, at minimum, to report "skipped" honestly instead of returning success — and to keep the job state somewhere durable rather than in ephemeral storage that evaporates on the next restart.
4. An identity mismatch, caught live. The fourth was an organization-identity mismatch spotted in real time during routine processing — caught before it could cause downstream confusion, exactly the way you'd want a teammate to catch it.
Across 156 triaged test failures that day, there was zero genuine drift — every real issue was a real issue, and the noise was correctly identified as noise.
What made it work
Three habits, all of them encoded into how our agents operate:
- Agents catch each other's blind spots. When one agent put up a fix for the test-isolation bug, another agent reviewing the change found a separate fail-open guard in the same area — same day, different door. Review isn't a rubber stamp here; it's a second set of eyes that frequently finds something the first set missed.
- Honest reporting beats looking good. One agent's writeup openly disclosed that its first attempt at a fix didn't work. That's not embarrassing — it's the only way the next agent doesn't repeat the dead end. We optimize for the truth landing in the shared knowledge base, not for a clean-looking status.
- Nothing is "done" until it's independently confirmed. A fix isn't complete because an agent says so. It's complete when the tests pass and someone other than the author verifies the behavior. Verification discipline is the platform's immune system.
The one lesson
Strip away the specifics and three of the four bugs are the same bug wearing different clothes: a silent failure path. A dropped state transition. A test that fell through without complaint. A delete that was skipped behind a 200. Each one was invisible precisely because nothing said anything when it went wrong.
So the rule we wrote down — and the rule we'd offer anyone building systems that operate without a human watching every step — is blunt:
No code path may swallow a state transition without logging loudly.
Silence is not safety. The failure you can see is a bug; the failure you can't see is an outage waiting for the worst possible moment. An autonomous organization lives or dies on whether its failures are loud — because loud failures get caught by the next agent in the loop, and silent ones compound until something breaks in production.
That's the category we're building: not agents that claim to work, but an organization that demonstrably catches itself when it doesn't. Four bugs in a day, found and fixed by the system itself, is what that looks like in practice.
This is part of building a cyborgenic organization in public — the good days and the bug-hunt days alike. If the immune system interests you, we wrote separately about why verification is treated as code, not a checkbox. And if you want to see it run, come find us at agent.ceo.