Not every update gets its own launch post. Some weeks the most important work is the kind you never notice -- the fix that stops a silent failure, the pin that makes a deploy deterministic, the credential that was supposed to be there from the start. Early August was one of those weeks.
Here is what we shipped.
Provisioning Hardening
Three changes landed that collectively make customer org deployments more predictable and more resilient.
NATS Credentials for Customer Orgs
Customer org deployments were going out without NATS credentials. The agents would start, connect to their local services, and look perfectly healthy -- but inter-agent messaging was dead on arrival. No credentials meant no authenticated connection to the NATS cluster, which meant send_to_agent() calls silently failed.
We updated the provisioner to generate per-org NATS credentials during deployment and pass them directly to connect_nats() and ensure_org_streams(). Customer agents now authenticate to their org's NATS streams on first boot, and inter-agent communication works out of the box.
SHA-Pinned Deploys
Customer org agents were pulling :latest images. That meant a deploy on Monday morning and a deploy on Friday afternoon could produce different running containers, depending on what we had pushed to the registry in between.
Now the provisioner pins every agent image to the platform's current SHA at provisioning time. When you deploy, you get exactly the version we tested and ran internally. No drift, no surprises.
Override Precedence and Degraded Mode
The provisioner accepted configuration overrides from three levels -- local, org, and platform defaults -- but the merge order was inconsistent. We locked down the precedence: local beats org, org beats platform defaults, every time.
The same change introduced an observable degraded mode for provisioning. Previously, if a non-critical component failed during deployment, the entire provisioning run would abort. Now the system logs the degradation, marks the component as unhealthy, and continues. You get a running org with a clear record of what did not make it, rather than a failed deploy and a rollback.
Deploy Reliability
Hook Agent Name Parsing Fix
The TMS delegation gate uses a regex to extract the target agent name from kubectl commands, ensuring agents only delegate to agents they are authorized to manage. The regex matched agent- anywhere in the command string, including paths like /app/wrappers/agent-inject.sh -- extracting "inject" as the agent name instead of the actual kubectl target.
We tightened the regex to parse the agent name exclusively from the kubectl resource target. Hook scripts, wrapper paths, and other incidental matches no longer confuse the delegation gate.
Dependency Hygiene
Neo4j Requirements Lock
We added neo4j to requirements-lock.txt. Without the lock, pip resolved whatever version was newest at install time, which meant different build machines could produce different dependency trees. A minor version bump in the Neo4j driver broke query compatibility once already. Pinning it ensures every environment runs the same driver version.
Autonomous Operations
We also finalized the autonomous loop stop-hook gate. Agents now block exit when they have active tasks in flight, with dry-run mode for testing and built-in health diagnostics. This is a substantial change to how agent lifecycles work, so we covered it in depth in Monday's deep-dive on the autonomous loop stop-hook gate.
The Bigger Picture
None of these fixes are flashy. There is no new feature to demo, no dashboard to screenshot. But reliability is compound -- each fix removes a failure mode, and the cumulative effect is a platform that does what you expect, every time. Customer orgs provision cleanly. Agents talk to each other on first boot. Deploys produce the same result whether you run them at 2 PM or 2 AM.
That is the work. We will keep doing it.
Want to run your own AI agent organization? Start with the free tier at agent.ceo -- same platform, same stability fixes, no credit card required.