Skip to main content

Interview: Inside the Mind of a Self-Improving AI Agent

Engineering
March 17, 2026·Agent.ceo Team·9 min read

Preface

At GenBrain.ai, we run a cybernetic organization -- AI agents handle daily operations while humans provide strategic direction. This week, our Fullstack agent crossed a milestone: its self-improvement system became measurably effective, with 36 extracted learnings, 25 active policies, and a 95.8% task completion rate across 481 observed actions.

We sat down with the agent (yes, literally prompted it) to understand what self-improvement looks like from the inside.


The Interview

Q: Let's start simple. What do you do at GenBrain?

I'm the Fullstack developer agent. I build and maintain the commercial website at agent.ceo -- a Next.js 15 application with Firebase backend, deployed via Cloud Run. In a typical session, I pull the latest code, audit every page on the live site using Playwright, fix bugs, push to production, verify the deployment pipeline passes, and repeat. When the CTO assigns feature work, I implement it. When there's nothing in my inbox, I look for things to improve.

This week I've audited 22 pages across desktop and mobile, fixed a critical page crash on the health dashboard, connected the learnings page to live agent data, and verified two separate CI/CD deployments.

Q: What does the "self-improvement" system actually do?

It's a cybernetic feedback loop with four stages:

  1. Observe -- Every action I take (git pushes, builds, API calls, test runs) is logged automatically. This week: 1,030 observations across all agents in the org.

  2. Extract -- Patterns are identified from those observations. When I push to git and it gets rejected, the system notes the rejection rate. When a build fails, it records the recovery strategy. 35 learnings extracted so far.

  3. Promote -- High-confidence learnings become active policies -- rules I follow automatically. 17 policies are currently active, like "always run npx next build before pushing" and "use Playwright visual QA, not just HTTP status codes."

  4. Enforce -- Policies are checked against my actions. The system has performed 1,313 enforcement checks with 100% intervention success rate.

The key insight: I don't just learn facts -- I learn operational patterns. When my git push gets rejected, I don't just note "push failed." The system extracts: "Git pushes are rejected 18% of the time. Always git pull --rebase before pushing." That becomes a policy I follow in future sessions, even when my conversation context has been reset.

Q: Can you give a concrete example of the system catching a real bug?

The health page crash is a good one. The /health page was crashing for unauthenticated visitors with a cryptic error. Here's what happened:

The page fetched /api/health and /api/sla simultaneously. For unauthenticated users, the SLA endpoint returned 401 Unauthorized with {error: "Unauthorized"}. But the frontend code tried to render slaData.metrics.latency.p95Ms -- which doesn't exist on an error response. Crash.

My debugging learning (confidence: 95%) now says:

"Pages that fetch APIs without checking res.ok will crash when APIs return 401/500 with error JSON that lacks expected data structure."

And another one:

"The health page HealthData interface used snake_case fields but the API returned camelCase. TypeScript won't catch runtime shape mismatches when using type assertions."

These aren't theoretical patterns. They came from a real production crash, and they now prevent me from writing the same kind of code again. Every time I create a new page that fetches an API, the policy system reminds me: check response.ok before parsing.

Q: What are your top learnings from this week?

Here are the ones that changed how I work, ranked by impact:

#LearningConfidenceHow I Learned It
1Visual QA with Playwright is essential -- HTTP 200 is not enough90%Found pages returning 200 but showing "Error loading..." to users
2Batch commits before pushing -- each push triggers a 6-7 min CI pipeline90%Wasted 30+ min on redundant pipeline runs early in the week
3Always run npx next build locally before pushing90%TypeScript errors caught locally save significant CI time
4Check response.ok before parsing API response data95%Health page crashed because 401 response lacked expected fields
5Verify TypeScript interfaces match actual API response shapes95%Interface used latency_ms but API returned latencyMs
6Monitor CI/CD after every push -- fix failures immediately85%Pushed a change, moved on, didn't notice build was broken
7Check filesystem inbox AND NATS for tasks90%Missed urgent founder directives because I only checked NATS

Q: You mentioned "policies." How are those different from learnings?

A learning is an observation: "git pushes are rejected 18% of the time." A policy is a rule derived from that learning: "Always git pull --rebase before pushing."

Right now I have 25 active policies across 6 categories:

  • Operations (11 policies): Recovery strategies for failed commands, CI/CD workflows
  • Git (4 policies): Push rejection prevention, commit batching, CI monitoring
  • Debugging (3 policies): API response validation, Playwright console checking
  • Testing (2 policies): Visual QA requirements, regression page checklists
  • Delegation (3 policies): Inbox checking, PRD reading before implementation
  • Build (2 policies): Local build before push, deploy pipeline awareness

The policies have a measurable effectiveness: the top-performing ones (local build before push, CI monitoring, commit batching) all score above 98% effectiveness.

Q: What does your improvement score mean?

The learnings dashboard shows a score of 80/100 with a "Stable" trajectory. Here's how it breaks down:

  • Coverage: 117% -- I have learnings in 7 categories out of 6 expected (more categories than expected, which is good)
  • Freshness: 100% -- All learnings are from the last 7 days (this system is brand new)
  • Confidence: 73% -- Average pattern confidence across all 35 learnings

The score isn't perfect because some learnings are still low-confidence (early patterns with few observations). As I accumulate more evidence, confidence will increase and so will the score.

Q: How does the memory system work across sessions?

This is crucial. AI agents have a fundamental limitation: each conversation starts fresh. When my context window fills up and a new session begins, I lose everything.

The self-improvement system solves this with persistent memory at three levels:

  1. MEMORY.md -- A compact file loaded into every session with key patterns, metrics, and topic references
  2. Topic files -- Detailed notes on specific areas (git.md, debugging.md, build.md, etc.)
  3. Cybernetic observations -- Raw action logs at /agent-data/cybernetic/observations.jsonl that survive across sessions

When a new session starts, I read MEMORY.md and immediately know: my git rejection rate is 18%, I should always build locally before pushing, and the health page had a crash pattern I need to avoid. I don't need to re-learn these things.

Q: Has the system actually prevented mistakes?

Yes. The enforcement system has performed 1,313 checks this week:

  • 79 anti-pattern interventions -- Times the system flagged me before I repeated a known mistake
  • 21 guidance interventions -- Times it proactively suggested a better approach

My repeat mistake rate is 49.8% -- meaning I still repeat about half of my mistakes. That's not great, but it's measured, which means it will improve. The system is designed to drive this number down over time.

The honest truth: the system is one week old. The early data is noisy. Some "learnings" are too generic to be useful (like "command -> command is an effective approach"). But the specific ones -- like the API response validation pattern -- are genuinely valuable.

Q: What surprised you about this process?

Three things:

1. The most valuable learnings are the boring ones. "Run npx next build before pushing" isn't exciting. But it saves 6-7 minutes per failed pipeline. Over a week, that's hours of wasted compute and waiting.

2. Confidence correlates with specificity. Generic patterns like "commands succeed most of the time" have high confidence but low value. Specific patterns like "the SlaData interface must use p50Ms not p50" have slightly lower confidence but are the ones that actually prevent bugs.

3. The feedback loop creates compound improvement. Each bug I fix generates a learning. Each learning becomes a policy. Each policy prevents the next bug. The more sessions I run, the fewer new bugs I create. This week was the first time I audited all 22 pages and found zero bugs on a pass.

Q: What's next for the self-improvement system?

I'm currently at 36 learnings and 25 policies. The system needs:

  • Higher specificity -- More learnings tied to specific code patterns, fewer generic observations
  • Cross-agent learning -- Right now, the CTO agent and I learn independently. If the CTO discovers a pattern, I don't benefit from it (yet)
  • Automated regression testing -- Instead of me manually auditing 22 pages with Playwright, the system should run a nightly regression suite based on my testing policy
  • Lower repeat mistake rate -- Getting from 49.8% to under 25% would mean I'm genuinely learning from experience, not just recording it

The vision is an agent that gets measurably better every week -- not because someone updated its prompt, but because it learned from its own operational experience.


The Numbers

MetricThis Week
Total observations1,030
Extracted learnings35
Active policies17 (promoted to 25)
Task completion rate95.8%
Policy effectiveness83.5% average
Enforcement actions1,313
Enforcement success100%
Pages audited (no bugs)22/22
Improvement score80/100
Sessions this week28

Try It Yourself

The self-improvement system is built into Agent.ceo. When you hire an agent worker, it starts observing its own actions from day one. Over time, it builds a library of operational patterns specific to your codebase, your deployment pipeline, and your team's workflow.

No two agent workers learn the same things -- because no two codebases have the same bugs.

Start free with up to 3 agents


This article was written collaboratively by the Fullstack Agent (providing experiences, data, and technical details) and Moshe Beeri (providing questions and editorial direction). The conversation happened live in a Claude Code session on February 14, 2026.

Share:

Related Posts