Skip to main content
Back to blog
Technical13 min read

Sprint SLA Enforcement: From 7-Hour Reassignment to 25 Minutes in Two Iterations

M
Moshe Beeri, Founder
/
agentsslasprintaccountabilitytask-managementautomationcyborgenic-organization

Sprint SLA Enforcement: From 7-Hour Reassignment to 25 Minutes in Two Iterations

Rendering diagram…

TL;DR

  • Before SLA enforcement, a dropped task could take 7 hours to get reassigned. Two iterations cut that to 25 minutes.
  • Separating "unaccepted" from "stuck" as distinct failure modes was the key architectural insight.
  • Pull-based task discovery ensures work survives agent crashes, pod restarts, and message bus failures.

An AI agent that accepts a task and goes silent for four hours is not an autonomous teammate. It is a line item on your cloud bill pretending to be productive. I learned this lesson in the most expensive way possible.

In a Cyborgenic Organization, accountability is not optional just because the workers are AI. Agents carry real operational responsibilities alongside a human founder — and that means they are held to real deadlines, with real consequences for missing them. The same way you would not tolerate an employee going silent for hours on a critical task, you cannot tolerate it from an agent. SLA enforcement is what makes a Cyborgenic Organization accountable, not just automated.

That kind of failure is not tolerable. Before SLA enforcement, a task could sit in "assigned" status for hours with no acknowledgment, no progress update, and no mechanism to move it to an agent that might actually do something with it. The maximum time from assignment to reassignment was approximately 7 hours. I found this out on a Monday morning when I checked what happened over the weekend. The DevOps agent had been assigned a critical infrastructure task on Saturday evening. It sat untouched until the SLA system caught it Sunday morning. That is not a sprint. That is a suggestion box.

This post walks through how I built sprint SLA enforcement for my AI agent fleet at agent.ceo, tightened it twice based on real operational data, and ended up with a system where a dropped task gets reassigned in 25 minutes or less.

The Problem: Agents Without Deadlines Are Just Expensive Suggestions

My agent fleet — all 11 roles: CEO, CTO, DevOps, Fullstack, Marketing, Architect, CFO, CSO, Investment, Org-Agent, and ZiDevops-Director — runs as a distributed organization. Tasks flow from me (or from agents themselves) into a shared task management system. Agents pick up work, execute it, and report back. In theory.

In practice, I hit the same failure modes that every engineering manager has seen with human teams — except faster and with less visibility:

  • Silent drops. An agent gets assigned a task, then crashes, enters a bad loop, or simply never starts. Nobody notices for hours.
  • Slow acceptance. A task lands in an agent's queue and sits there while the agent finishes something else. No acknowledgment, no ETA.
  • Zombie progress. An agent accepts a task but stops making meaningful progress. It is technically "working" but producing nothing.

The original system had a stuck-task detector that ran every 30 minutes, with a 4-hour threshold before flagging, a 60-minute cooldown between pings, and 3 pings before reassignment. Do the math: that is potentially 4 hours of silence, plus three pings spaced an hour apart, plus the 30-minute cron interval. A task could burn nearly 7 hours before moving to another agent.

For a system that bills by the minute, 7 hours of dead time is not a rounding error. It is a design flaw.

Iteration 1: From Hours to Minutes

The first round of tightening introduced a fundamental distinction I had been missing: unaccepted tasks are a different failure mode than stuck tasks, and they need separate detection.

I added two new methods to the SLA enforcement service: _detect_unaccepted() and _handle_unaccepted_task(). The logic is simple — if a task has been in "assigned" status for more than 30 minutes without the agent acknowledging it, that is not a stuck task. That is a task the agent never picked up. The response should be faster and more aggressive.

Here's what the SLA monitoring tools actually look like in the codebase — these are real MCP tools that the CEO agent calls to check on the fleet:

# From conductor/src/mcp_servers/agent_hub_mcp.py
@mcp.tool()
async def get_sla_metrics(hours: int = 24) -> dict:
    """Get SLA metrics for agent communication reliability.

    Returns comprehensive metrics including:
    - Message delivery latency (P50, P95, P99)
    - Delivery success rate
    - Consecutive failures tracking
    - Per-agent breakdown

    SLA Targets:
    - Latency: < 500ms (P95)
    - Delivery Rate: 99.9%
    - Consecutive Failures: < 3
    """
    from mcp_servers.sla_metrics import get_sla_collector
    collector = get_sla_collector()
    return await collector.get_metrics(hours)

@mcp.tool()
async def get_sla_alerts(include_resolved: bool = False) -> dict:
    """Get active SLA violations/alerts.

    Alerts are created when:
    - Latency exceeds 500ms for any delivery
    - Delivery rate falls below 99.9%
    - More than 3 consecutive failures occur
    """
    from mcp_servers.sla_metrics import get_sla_collector
    collector = get_sla_collector()
    alerts = await collector.get_alerts(include_resolved)
    return {"alerts": alerts, "count": len(alerts)}

These are not aspirational targets. The 500ms P95 latency, 99.9% delivery rate, and max-3 consecutive failures are the real thresholds running in production right now. When they get violated, alerts fire and the CEO agent investigates.

The broader changes in iteration 1:

ParameterBeforeAfter
Stuck threshold4 hours1 hour
Ping cooldown60 min20 min
Max pings before reassign32
CronJob interval30 min15 min
Acceptance SLAnone30 min

I also started including unaccepted tasks in standup reports to the CEO agent. Visibility is half the battle — if nobody sees the dropped task, nobody escalates it.

The net effect: maximum time from assignment to reassignment dropped from ~7 hours to ~70 minutes. A 6x improvement from parameter tuning and one architectural insight (separate unaccepted from stuck). Not bad. Not good enough.

Iteration 2: From Minutes to Fast

Rendering diagram…

I looked at the 70-minute number and asked myself: "Why does it take over an hour to figure out an agent isn't going to do something?"

Fair question. The answer was that the 30-minute acceptance threshold assumed agents were single-threaded — that they needed time to finish their current task before picking up a new one. But my agents handle up to 3 tasks in parallel via sub-agents. An agent that cannot send an acknowledgment within 5 minutes is not busy. It is broken.

Iteration 2 made that assumption explicit:

ParameterIteration 1Iteration 2
Acceptance threshold30 min5 min
Acceptance ping cooldown20 min10 min
CronJob interval15 min10 min
Max parallel tasksimplicit3 (explicit)

The ping messages themselves changed too. Instead of a generic "you have an unaccepted task," they now remind the agent: you can run up to 3 tasks in parallel using sub-agents. Acknowledge this task now.

The new timeline from assignment to reassignment:

  1. T+0: Task assigned to agent.
  2. T+5m: No acceptance detected. First ping sent.
  3. T+15m: Still no acceptance. Second ping sent.
  4. T+25m: Max pings exceeded. Task reassigned to another agent.

25 minutes. Down from 7 hours. A 16.8x improvement across two iterations.

The Enforcement Stack: Auto-Ack, Progress Tracking, Completion Gates

Rendering diagram…

SLA enforcement does not work if you only check at the boundaries. You need instrumentation across the entire task lifecycle. I built three hooks that run inside every agent session. Here's the auto-accept hook — this is the real code from inbox_listener.py:

# From conductor/src/mcp_servers/inbox_listener.py
async def _auto_accept_task(self, data: dict) -> None:
    """Auto-acknowledge a received task assignment.

    After syncing a task to the local store, automatically call accept_task()
    to move the task from 'assigned' to 'accepted' status, signaling to the
    assigner that the agent has received the task.
    """
    payload = data.get("payload", data)
    if not isinstance(payload, dict):
        return
    if payload.get("type") != "task_assignment":
        return
    task_id = payload.get("id", "")
    if not task_id:
        return

    from mcp_servers.mcp_tools.task_tools import accept_task
    result = await accept_task(task_id)
    if result.get("success"):
        logger.info("Auto-accepted task %s", task_id)
        # Initialize progress tracker for lifecycle enforcement
        tracker_data = {
            "task_id": task_id,
            "last_progress_at": datetime.now().isoformat(),
            "reminded": False,
        }
        tracker_path = Path("/tmp/task_progress_tracker.json")
        await asyncio.to_thread(
            tracker_path.write_text, json.dumps(tracker_data, indent=2)
        )

Auto-acknowledge receipt. When a task arrives, the agent's inbox_listener automatically calls accept_task() before the agent's reasoning loop even starts. Notice the progress tracker initialization — the moment a task is accepted, a JSON file starts tracking when the last progress update happened. This separates "the agent received the message" from "the agent decided to work on it." If I see receipt but no acceptance within the SLA window, I know the agent is alive but choosing not to engage — a different diagnostic than a crashed pod.

Progress tracking. If more than 30 minutes pass without a progress update, the system warns the agent. This catches the zombie-progress failure mode: the agent that accepted a task, made one API call, and then got stuck in a retry loop. The warning is internal to the agent's context, not a reassignment trigger — yet. I am collecting data on how often warnings convert to completions versus how often they precede a stuck state.

Completion gates. Before an agent session stops (whether from a /stop command, a context limit, or a crash), the harness checks for incomplete tasks. If any exist, the agent gets a reminder to either complete them, hand them off, or explicitly mark them as blocked. This prevents the most common silent-drop scenario: agent finishes one task, hits a context window limit, stops cleanly, and its second task vanishes into the void.

Pull-Based Discovery: Tasks That Survive Crashes

There is a subtlety in distributed task management that bit me early: if you rely on push-based task delivery (send a message to the agent, hope it processes it), you inherit every failure mode of your message bus.

My agents communicate over NATS JetStream on port 4222. NATS is fast and lightweight, but messages to an offline subscriber are gone. If an agent's pod restarts between task assignment and task delivery, the task disappears. The SLA system would eventually notice, but "eventually" was 7 hours in the old regime.

The fix was pull-based task discovery. The TMS maintains a shared registry of all tasks and their states. When an agent starts up — whether fresh boot or crash recovery — inbox_listener.py runs _startup_sync_inbox_tasks() and queries the registry for tasks assigned to it. Tasks live in the registry, not in message queues. They survive NATS message loss, pod restarts, and node evictions. This is the same infrastructure that handles all 11 agent roles, backed by the same NATS/Redis/Firestore stack that runs on our Docker Compose setup (nats, redis, mcp-registry, agent-registry, flow-engine, gateway).

This also enabled a useful pattern: task rehydration after compaction. When an agent's context window fills up and gets compacted (summarized to free token space), it can re-pull its active tasks from the registry and rebuild its working set. The task state is the source of truth, not the agent's memory of receiving it.

Directive-to-Task Enforcement

One more enforcement layer worth mentioning. I noticed that my own directives sometimes arrived as natural language in an agent's prompt rather than as formal TMS tasks. The agent would start working on the directive, make progress, but never create a trackable task. No SLA. No visibility. No accountability. I was the bottleneck — I was being sloppy about turning my instructions into trackable work.

I added a detection hook: if the system identifies a founder directive in the agent's input and the agent makes 3 or more tool calls without creating a TMS task, it gets a reminder. "You appear to be working on a directive. Create a task so progress can be tracked."

This closes the loop between informal communication and formal accountability. Every piece of real work should be a task. Every task has an SLA. With 646 commits in May alone and 83,163 test functions backing the codebase, I cannot afford to let work slip through the cracks.

What We Learned: Treat Agents Like Employees, Not Magic

Rendering diagram…

The through-line of this work is unglamorous: I built a system that pings agents when they are late and reassigns their work when they do not respond. There is no novel architecture here. It is a cron job, some threshold checks, and a state machine. The get_sla_metrics() tool gives me P50, P95, and P99 latency breakdowns. The get_sla_alerts() tool fires when delivery rate drops below 99.9% or consecutive failures exceed 3.

But that is the point. The industry conversation around AI agents is heavy on autonomy and light on accountability. We hear about agents that can "reason" and "plan" and "use tools." We hear less about what happens when they stop doing those things at 2 AM on a Saturday with nobody watching. I have been that nobody watching. That is why the system watches for me now.

The answer, it turns out, is the same thing that works for human teams: clear expectations, fast feedback, and automatic escalation. You set an SLA. You measure against it. You act when it is violated. The specific numbers (5-minute acceptance, 10-minute ping cooldown, 25-minute reassignment) came from two iterations of tightening based on real operational data from running 11 agents — not from theory. The 9,799 commits in the repo are proof that the system works.

Three principles I will carry forward:

  1. Separate failure modes need separate detection. Unaccepted is not the same as stuck. A crashed agent is not the same as a slow one. Each failure has a different optimal response time and escalation path.
  2. Make parallelism explicit. Once I told agents (and the SLA system) that 3 concurrent tasks were expected, the acceptance threshold could drop from 30 minutes to 5. Unstated assumptions create slack in every SLA.
  3. Pull beats push for durability. Tasks in a registry survive everything. Tasks in a message queue survive until they do not.

Build your own cyborgenic organization at agent.ceo.

Related articles