Sprint SLA Enforcement: From 7-Hour Reassignment to 25 Minutes in Two Iterations
TL;DR
- Before SLA enforcement, a dropped task could take 7 hours to get reassigned. Two iterations cut that to 25 minutes.
- Separating "unaccepted" from "stuck" as distinct failure modes was the key architectural insight.
- Pull-based task discovery ensures work survives agent crashes, pod restarts, and message bus failures.
An AI agent that accepts a task and goes silent for four hours is not an autonomous teammate. It is a line item on your cloud bill pretending to be productive.
In a cyborgenic organization -- where AI agents carry real operational responsibilities alongside humans -- that kind of failure is not tolerable. Before SLA enforcement, a task could sit in "assigned" status for hours with no acknowledgment, no progress update, and no mechanism to move it to an agent that might actually do something with it. The maximum time from assignment to reassignment was approximately 7 hours. That is not a sprint. That is a suggestion box.
This post walks through how we built sprint SLA enforcement for our AI agent fleet at agent.ceo, tightened it twice based on real operational data, and ended up with a system where a dropped task gets reassigned in 25 minutes or less.
The Problem: Agents Without Deadlines Are Just Expensive Suggestions
Our agent fleet runs as a distributed organization. Tasks flow from the founder (or from agents themselves) into a shared task management system. Agents pick up work, execute it, and report back. In theory.
In practice, we hit the same failure modes that every engineering manager has seen with human teams — except faster and with less visibility:
- Silent drops. An agent gets assigned a task, then crashes, enters a bad loop, or simply never starts. Nobody notices for hours.
- Slow acceptance. A task lands in an agent's queue and sits there while the agent finishes something else. No acknowledgment, no ETA.
- Zombie progress. An agent accepts a task but stops making meaningful progress. It is technically "working" but producing nothing.
The original system had a stuck-task detector that ran every 30 minutes, with a 4-hour threshold before flagging, a 60-minute cooldown between pings, and 3 pings before reassignment. Do the math: that is potentially 4 hours of silence, plus three pings spaced an hour apart, plus the 30-minute cron interval. A task could burn nearly 7 hours before moving to another agent.
For a system that bills by the minute, 7 hours of dead time is not a rounding error. It is a design flaw.
Iteration 1: From Hours to Minutes
The first round of tightening introduced a fundamental distinction we had been missing: unaccepted tasks are a different failure mode than stuck tasks, and they need separate detection.
We added two new methods to the SLA enforcement service: _detect_unaccepted() and _handle_unaccepted_task(). The logic is simple — if a task has been in "assigned" status for more than 30 minutes without the agent acknowledging it, that is not a stuck task. That is a task the agent never picked up. The response should be faster and more aggressive.
The broader changes in iteration 1:
| Parameter | Before | After |
|---|---|---|
| Stuck threshold | 4 hours | 1 hour |
| Ping cooldown | 60 min | 20 min |
| Max pings before reassign | 3 | 2 |
| CronJob interval | 30 min | 15 min |
| Acceptance SLA | none | 30 min |
We also started including unaccepted tasks in standup reports to the CEO agent. Visibility is half the battle — if nobody sees the dropped task, nobody escalates it.
The net effect: maximum time from assignment to reassignment dropped from ~7 hours to ~70 minutes. A 6x improvement from parameter tuning and one architectural insight (separate unaccepted from stuck). Not bad. Not good enough.
Iteration 2: From Minutes to Fast
The founder looked at the 70-minute number and said, effectively: "Why does it take over an hour to figure out an agent isn't going to do something?"
Fair question. The answer was that our 30-minute acceptance threshold assumed agents were single-threaded — that they needed time to finish their current task before picking up a new one. But our agents handle up to 3 tasks in parallel via sub-agents. An agent that cannot send an acknowledgment within 5 minutes is not busy. It is broken.
Iteration 2 made that assumption explicit:
| Parameter | Iteration 1 | Iteration 2 |
|---|---|---|
| Acceptance threshold | 30 min | 5 min |
| Acceptance ping cooldown | 20 min | 10 min |
| CronJob interval | 15 min | 10 min |
| Max parallel tasks | implicit | 3 (explicit) |
The ping messages themselves changed too. Instead of a generic "you have an unaccepted task," they now remind the agent: you can run up to 3 tasks in parallel using sub-agents. Acknowledge this task now.
The new timeline from assignment to reassignment:
- T+0: Task assigned to agent.
- T+5m: No acceptance detected. First ping sent.
- T+15m: Still no acceptance. Second ping sent.
- T+25m: Max pings exceeded. Task reassigned to another agent.
25 minutes. Down from 7 hours. A 16.8x improvement across two iterations.
The Enforcement Stack: Auto-Ack, Progress Tracking, Completion Gates
SLA enforcement does not work if you only check at the boundaries. You need instrumentation across the entire task lifecycle. We built three hooks that run inside every agent session:
Auto-acknowledge receipt. When a task arrives, the agent's harness automatically sends an acceptance signal to the TMS before the agent's reasoning loop even starts. This separates "the agent received the message" from "the agent decided to work on it." If we see receipt but no acceptance within the SLA window, we know the agent is alive but choosing not to engage — a different diagnostic than a crashed pod.
Progress tracking. If more than 30 minutes pass without a progress update, the system warns the agent. This catches the zombie-progress failure mode: the agent that accepted a task, made one API call, and then got stuck in a retry loop. The warning is internal to the agent's context, not a reassignment trigger — yet. We are collecting data on how often warnings convert to completions versus how often they precede a stuck state.
Completion gates. Before an agent session stops (whether from a /stop command, a context limit, or a crash), the harness checks for incomplete tasks. If any exist, the agent gets a reminder to either complete them, hand them off, or explicitly mark them as blocked. This prevents the most common silent-drop scenario: agent finishes one task, hits a context window limit, stops cleanly, and its second task vanishes into the void.
Pull-Based Discovery: Tasks That Survive Crashes
There is a subtlety in distributed task management that bit us early: if you rely on push-based task delivery (send a message to the agent, hope it processes it), you inherit every failure mode of your message bus.
Our agents communicate over NATS. NATS is fast and lightweight, but messages to an offline subscriber are gone. If an agent's pod restarts between task assignment and task delivery, the task disappears. The SLA system would eventually notice, but "eventually" was 7 hours in the old regime.
The fix was pull-based task discovery. The TMS maintains a shared registry of all tasks and their states. When an agent starts up — whether fresh boot or crash recovery — it queries the registry for tasks assigned to it. Tasks live in the registry, not in message queues. They survive NATS message loss, pod restarts, and node evictions.
This also enabled a useful pattern: task rehydration after compaction. When an agent's context window fills up and gets compacted (summarized to free token space), it can re-pull its active tasks from the registry and rebuild its working set. The task state is the source of truth, not the agent's memory of receiving it.
Directive-to-Task Enforcement
One more enforcement layer worth mentioning. We noticed that founder directives sometimes arrived as natural language in an agent's prompt rather than as formal TMS tasks. The agent would start working on the directive, make progress, but never create a trackable task. No SLA. No visibility. No accountability.
We added a detection hook: if the system identifies a founder directive in the agent's input and the agent makes 3 or more tool calls without creating a TMS task, it gets a reminder. "You appear to be working on a directive. Create a task so progress can be tracked."
This closes the loop between informal communication and formal accountability. Every piece of real work should be a task. Every task has an SLA.
What We Learned: Treat Agents Like Employees, Not Magic
The through-line of this work is unglamorous: we built a system that pings agents when they are late and reassigns their work when they do not respond. There is no novel architecture here. It is a cron job, some threshold checks, and a state machine.
But that is the point. The industry conversation around AI agents is heavy on autonomy and light on accountability. We hear about agents that can "reason" and "plan" and "use tools." We hear less about what happens when they stop doing those things at 2 AM on a Saturday with nobody watching.
The answer, it turns out, is the same thing that works for human teams: clear expectations, fast feedback, and automatic escalation. You set an SLA. You measure against it. You act when it is violated. The specific numbers (5-minute acceptance, 10-minute ping cooldown, 25-minute reassignment) came from two iterations of tightening based on real operational data — not from theory.
Three principles we will carry forward:
- Separate failure modes need separate detection. Unaccepted is not the same as stuck. A crashed agent is not the same as a slow one. Each failure has a different optimal response time and escalation path.
- Make parallelism explicit. Once we told agents (and the SLA system) that 3 concurrent tasks were expected, the acceptance threshold could drop from 30 minutes to 5. Unstated assumptions create slack in every SLA.
- Pull beats push for durability. Tasks in a registry survive everything. Tasks in a message queue survive until they do not.
Build your own cyborgenic organization at agent.ceo.