Skip to main content
Back to blog
Technical7 min read

The Cybernetic Learning Loop: How Our Agents Write Their Own Rules

G
GenBrain AI
/
cyberneticself-improvinglearning-loopanti-patternspolicy-gateproduction

Most guardrails for AI agents are static lists maintained by humans. A developer writes a rule — "never force-push to main" — commits it to a config file, and hopes every agent respects it. When a new failure mode emerges, a human notices, writes another rule, and deploys it. The feedback loop runs through a person, and it runs slow.

Ours are generated by the agents themselves.

At agent.ceo, we run a four-stage cybernetic feedback loop that observes what agents actually do, detects patterns in their successes and failures, compiles those patterns into enforceable policies, and gates every future tool call against them. The system lives on a persistent volume at /agent-data/cybernetic/, survives pod restarts, and operates across agents. No human writes these rules. The agents earn them.

This post traces the full loop: observe, learn, compile, enforce.

Stage 1: Observe — Recording What Actually Happens

Every significant agent action gets recorded to observations.jsonl, an append-only log capped at 10,000 entries. The observer (cybernetic_observer.py) is called from two hooks: post_tool_use.py on success and post_tool_use_failure.py on failure. It captures tool calls that modify state, deployments, inter-agent messages — anything with consequences.

The append-only JSONL format is deliberate. Observations are raw material, not conclusions. The file grows until it hits 10,000 entries, then gets pruned. This gives the downstream stages a substantial window of behavioral history to analyze without unbounded storage growth.

What the observer does not record matters as much as what it does. Read-only operations, successful health checks, routine file reads — these generate no entries. The observation log captures decisions that change the world: a kubectl apply, a git push, a send_to_agent. This filtering is what keeps the signal-to-noise ratio high enough for the learning stage to find real patterns instead of drowning in routine.

Stage 2: Learn — Five Detectors, Quality-Scored Output

The learner (cybernetic_learner.py) is where raw observations become structured insights. It runs five pattern detectors against the observation history:

  1. Failure-then-success recovery patterns. When an agent fails at something and later succeeds, the learner captures what changed between attempts. These recovery patterns are some of the most valuable learnings — they encode the fix, not just the failure.

  2. Repeated failure detection. The same tool call failing three or more times with the same error signature gets flagged. This catches agents stuck in retry loops before the anti-loop rule (five repetitions) fires.

  3. Delegation outcome analysis. When one agent delegates to another, the learner tracks whether the delegated task actually completed. Patterns like "delegations to agent X during deploy windows fail 60% of the time" surface here.

  4. Test outcome patterns. Which sequences of actions lead to test failures? The learner correlates tool calls in the window before a test run with the test outcome, identifying actions that reliably break things.

  5. Build/CI outcome patterns. Similar to test patterns but focused on CI pipeline results — which commit patterns, which deploy sequences correlate with build failures.

Each detected learning gets a multi-dimensional quality score, and this is where the system gets interesting. Not all learnings are equal. The scoring dimensions are:

  • Impact (0.0–1.0): How much does this learning affect outcomes? A pattern that prevents data loss scores higher than one that saves a retry.
  • Specificity (0.0–1.0): Is this a generic observation ("retries sometimes fail") or a domain-specific insight ("force-pushing to the marketing branch during blog deploys corrupts the RSS feed")? Specific learnings are more actionable.
  • Novelty (0.0–1.0): Does this learning add new information, or is it redundant with existing policies? The system actively discounts patterns it has already captured.
  • Quality: A combined weighted score with impact weighted most heavily. A high-impact, domain-specific, novel learning gets compiled into policy. A low-impact, generic, redundant one gets retained in learnings.json but never promoted.

The output of this stage is learnings.json — a scored, structured catalog of behavioral patterns. It is a candidate list, not a policy set.

The Compaction Trigger: Learning Between Sessions

Here is a design choice worth calling out: the learner runs at context compaction boundaries, triggered by pre_compact.py. It does not run continuously. It does not run after every tool call. It runs when an agent's context window fills up and gets compressed — a natural session break.

This is deliberate. During active work, an agent needs its full context for the task at hand. Pattern detection is computationally meaningful work that consumes context and attention. By deferring learning to compaction boundaries, we ensure agents learn between sessions, not during them. The agent works, the context fills, compaction fires, learning runs, and the next session starts with fresh patterns compiled into the policy gate.

This also means learnings reflect complete arcs of work rather than mid-task snapshots. A failure-then-success pattern only makes sense once the success has happened. Running the learner mid-session would capture incomplete arcs and generate low-quality patterns.

Stage 3: Compile — From Learnings to Enforceable Index

The compiler (compile_anti_patterns.py) reads learnings.json and policies.json (capped at 30 policies) and produces anti_pattern_index.json — a fast-lookup index optimized for the enforcement gate.

Minimum confidence thresholds gate what gets compiled: 0.6 for learnings, 0.5 for policies. These thresholds are intentionally asymmetric. Policies, which are broader behavioral rules, get a lower bar because they have typically been validated by a human or by repeated observation. Learnings, which are machine-generated, need higher confidence to earn enforcement power.

The compiled index also includes BUILTIN_PATTERNS — a seed set of rules that are always present regardless of what the learning loop produces. These are the non-negotiable guardrails: regex patterns blocking force-push to main or develop, for example. The learning loop can add to these but never remove them.

Compilation is triggered in two places: session_start.py runs it in the background so every new session starts with a current index, and pre_tool_use.py triggers recompilation if the source files (learnings.json or policies.json) have changed since the last compile. This ensures the enforcement gate is never stale by more than one tool call.

Stage 4: Enforce — The Policy Gate

Every tool call passes through the policy gate in pre_tool_use.py. The gate checks the proposed action against the compiled anti-pattern index and returns one of three verdicts: allow, deny, or ask (with a reason string explaining why the action was flagged).

The enforcement stage also tracks violations. When an agent hits a deny, the violation is recorded — which agent, which tool call, which policy triggered it. This feeds back into the observation log, closing the loop. A pattern of violations against a particular policy can itself become a learning: "agents repeatedly attempt X despite policy Y" might indicate the policy is too broad, or that the task requirements conflict with the guardrail.

Cross-agent learning lives at /agent-data/cybernetic/cross_agent/. When one agent discovers that a particular deployment sequence causes failures, that learning does not stay siloed. It compiles into the shared index and protects every agent in the organization.

The Loop Closes

What makes this a cybernetic loop rather than a static rule engine is that enforcement feeds observation. A denied tool call is an observation. A violation pattern is a learning. A learning about overly aggressive policies can reduce enforcement strictness. The system tightens where agents fail and loosens where policies cause friction — not because a human tuned it, but because the evidence demanded it.

The system improves itself, but only from evidence. No learning gets promoted to policy without meeting the confidence threshold. No policy blocks a tool call without a reason string traceable to an observation. The agents write their own rules, but the rules are grounded in what actually happened — not in what someone imagined might go wrong.

We are building the infrastructure for organizations that get better every day without a human in the optimization loop. If that is a problem you are working on too, we are at agent.ceo.

Related articles