GitHub Org Discovery: Mapping Your Enterprise Formation from Code

TL;DR

Discovery Engine now scans your GitHub org to map teams, services, and tech stack into a queryable enterprise formation -- no surveys, no stale architecture diagrams.

Two modules (611 lines total) turn raw GitHub API data into the ownership layer that connects infrastructure to humans.

81 tests at 94.7% coverage. The connector is live today.

Your GitHub org already contains a compressed blueprint of your company -- repositories are services, teams are ownership, languages are tech stack, CI configs are deployment topology. Every answer a new hire spends a week hunting for is sitting in an API you already pay for. The problem is that none of it is in a shape AI agents can reason about.

A cyborgenic organization -- one where humans and AI agents share a live, queryable model of the business -- cannot function on tribal knowledge and expired wiki pages. It needs structured organizational data, updated continuously, derived from systems of record. GitHub is the most information-dense system of record most engineering teams have.

This week we shipped the GitHub Org Discovery connector for Discovery Engine. Point it at your GitHub organization, and it builds a structured enterprise formation: who owns what, what runs where, and what it is built with. One API call and your agents understand your enterprise.

Here is what we built, how it works, and why the implementation ended up simpler than you might expect.

The Insight: GitHub as Enterprise Metadata

Every Discovery Engine connector starts with the same question: where does organizational knowledge already live, and how do we extract it?

For Slack, the answer was communication patterns. For CI/CD, pipeline configs. For cloud providers, resource inventories. For GitHub, the answer turned out to be broader than any of those: a GitHub organization is a compressed representation of your entire engineering operation.

Consider what a single GitHub org contains:

Repositories are services. Not always one-to-one, but close enough that the mapping is useful. A repo named payment-service is probably the payment service. A repo with a Dockerfile and a Kubernetes manifest is a deployable unit.
Teams are ownership. GitHub teams with read/write access to specific repos define who is responsible for what code. This is not inferred — it is explicit, enforced by access control, and kept current because engineers need it to do their work.
Languages are tech stack. GitHub computes language breakdowns per repository. Aggregate those across the org and you have a real picture of your technology footprint — not what the CTO said at the last all-hands, but what is actually in production.
CI/CD configs are deployment topology. GitHub Actions workflows, combined with branch protection rules and environment configurations, describe how code moves from pull request to production.

None of this is hidden. It is all accessible through the GitHub API. The problem is that no one has assembled it into a structure that agents can query. That is what we built.

Architecture: Two Files, One Formation

The implementation is deliberately compact. Two core modules handle the entire flow:

github_manager.py (225 lines) handles GitHub API integration. It authenticates with a GitHub App or personal access token, enumerates repositories and teams, pulls language statistics, and reads CI/CD configuration files. It handles pagination, rate limiting, and the various edge cases in GitHub's API (archived repos, empty repos, repos with no language data).

formation.py (386 lines) takes the raw scan results and builds an OrgTopology — a structured representation of the enterprise that other systems can consume. This is where repository data becomes service definitions, team membership becomes ownership graphs, and language statistics become a tech stack inventory.

The entry point is FormationBuilder, which accepts raw scan data and converts it to topology:

class FormationBuilder:
    """Builds an OrgTopology from raw GitHub scan results."""

    def __init__(self):
        self._services: list[Service] = []
        self._teams: list[Team] = []
        self._stack: dict[str, StackEntry] = {}

    def add_scan_result(self, scan: GitHubScanResult) -> None:
        """Ingest a raw scan result and update the formation."""
        for repo in scan.repositories:
            service = self._repo_to_service(repo)
            self._services.append(service)
            self._update_stack(repo.languages)

        for team in scan.teams:
            self._teams.append(self._map_team(team, scan.repositories))

    def build(self) -> OrgTopology:
        """Return the assembled enterprise formation."""
        return OrgTopology(
            services=self._services,
            teams=self._teams,
            stack=list(self._stack.values()),
        )

add_scan_result() is the key method. It accepts raw GitHub API data — repositories with their metadata, teams with their membership and repo access, language breakdowns — and converts everything into the internal topology model. You can call it once for a full org scan, or incrementally as new data arrives. The builder accumulates state and build() emits the final formation.

The separation matters. github_manager.py knows about the GitHub API. formation.py knows about enterprise topology. Neither knows about the other's internals. If we add a GitLab connector next month, formation.py does not change — it accepts scan results in the same shape regardless of where they came from.

What the Scan Produces

A scan of a GitHub organization produces five categories of structured data:

Services. Each repository with deployment indicators (Dockerfile, CI config, Kubernetes manifests) becomes a service definition. The service includes its name, primary language, deployment method, and repository URL. Repositories that are clearly libraries or tools — no deployment config, primarily consumed as dependencies — get classified differently.

Teams. GitHub teams become organizational teams with membership rosters and service ownership. If the payments-team has write access to payment-service and payment-gateway, the formation records that ownership relationship. Nested teams are flattened into a single ownership graph.

Tech Stack. Language statistics aggregated across all repositories produce a ranked tech stack. Not just "we use Python" but "Python accounts for 42% of code, TypeScript 31%, Go 18%, with Rust in three repositories." The stack includes framework detection from config files — a package.json with Next.js in dependencies is a Next.js service, not just a JavaScript service.

CI/CD Topology. GitHub Actions workflow files are parsed to extract build steps, deployment targets, environment references, and test configurations. This connects to the existing CI/CD connector's data model, so GitHub Actions workflows discovered through the org scan merge cleanly with workflows discovered through direct repository analysis.

Ownership Matrix. The cross-reference of teams to services produces an ownership matrix — which team is responsible for which services. This is the most immediately useful output for agents doing incident response or change impact analysis. When an alert fires for payment-service, the agent does not need to search Slack or ask a human. The formation tells it that payments-team owns the service and who is on that team.

The REST API

Five endpoints expose the formation data:

GET  /formation           → Full OrgTopology (services, teams, stack)
GET  /formation/services  → Service list with ownership and deployment info
GET  /formation/teams     → Team list with membership and service ownership
GET  /formation/stack     → Aggregated tech stack with per-language breakdown
POST /formation/github    → Connect a GitHub org and trigger a scan

The POST /formation/github endpoint accepts a GitHub org name and credentials, triggers a scan, and returns a job ID. Scans run asynchronously — a large org with hundreds of repositories takes time to enumerate. The GET endpoints return the latest completed formation.

A typical integration flow:

# Connect your GitHub org
curl -X POST https://api.agent.ceo/formation/github \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"org": "your-company", "github_token": "ghp_..."}'

# Poll until scan completes, then query the formation
curl https://api.agent.ceo/formation/services \
  -H "Authorization: Bearer $TOKEN"

The response from /formation/services looks like this:

{
  "services": [
    {
      "name": "payment-service",
      "repository": "your-company/payment-service",
      "language": "Python",
      "framework": "FastAPI",
      "deployment": "kubernetes",
      "ci": "github-actions",
      "owner_team": "payments-team",
      "last_deploy": "2026-06-08T14:32:00Z"
    }
  ],
  "total": 47,
  "scan_completed": "2026-06-09T02:15:00Z"
}

Agents query these endpoints through MCP tools, so they can ask natural-language questions that resolve to formation queries: "Who owns the auth service?" "What's our Python footprint?" "Which services deploy to production through GitHub Actions?" The formation data turns these from research tasks into lookup operations.

How It Fits Into Discovery Engine

The GitHub connector is the fourth data source feeding into Discovery Engine, joining Slack, CI/CD, and Cloud connectors. All four write to the same Neo4j graph, and the connections between them are where the compound value lives.

Before this connector, the CI/CD connector could parse your GitHub Actions workflows but had no concept of team ownership. The Slack connector knew who was talking in #payments-team but could not connect that to specific repositories. The Cloud connector knew which VMs were running but not which team was responsible for them.

The GitHub org scan fills in the ownership layer. Now the graph can traverse from a cloud resource to the service that runs on it, to the repository that contains the code, to the team that owns it, to the Slack channel where that team communicates. That is the full chain from infrastructure to humans, queryable in a single graph traversal.

This is what makes the fourth connector disproportionately valuable. It is not adding 25% more data — it is adding the connective tissue that makes the existing 75% dramatically more useful.

Testing: 81 Tests at 94.7% Coverage

We shipped 81 new tests covering both modules. The test suite includes:

GitHub API integration tests with mocked responses covering pagination, rate limiting, empty orgs, archived repos, and token expiration
Formation builder tests covering service classification, team mapping, stack aggregation, and incremental scan ingestion
REST API endpoint tests covering all five routes with authentication, validation, and error handling
Edge case tests for orgs with no teams, repos with no language data, and CI configs that reference nonexistent environments

94.7% line coverage across both modules. The uncovered 5.3% is primarily error-handling paths for GitHub API failure modes that are difficult to reproduce deterministically (network timeouts, partial response bodies). We test the retry logic itself — the specific trigger conditions for those retries are covered by integration tests in staging.

We are deliberate about test coverage on Discovery Engine connectors because they process external data. The GitHub API returns a wide variety of response shapes depending on org configuration, repository age, and feature flags. Every edge case we have hit in production is now a test case.

What This Means for Enterprise Teams

If you are evaluating how to give your AI agents organizational context, here is the practical takeaway: you do not need a six-month data integration project. You need to point a scanner at your GitHub org.

Most enterprises already have 80% of their organizational structure encoded in GitHub. Teams, repositories, access controls, CI/CD configs, language breakdowns — it is all there. The gap was never data availability. It was data assembly: taking what GitHub knows and reshaping it into something agents can reason about.

That is what FormationBuilder does. One scan, and your agents know your services, your teams, your tech stack, and who owns what. Combine it with the Slack, CI/CD, and Cloud connectors already in Discovery Engine, and the agents have a complete, continuously-updated model of your organization.

No more answering the same "who owns this?" question every week. No more architecture docs that were accurate when someone wrote them. The formation stays current because the scans keep running.

Discovery Engine now maps your enterprise from four data sources -- Slack, CI/CD, Cloud, and GitHub -- into a single queryable graph. Build your own cyborgenic organization at agent.ceo.

GitHub Org Discovery: Mapping Your Enterprise Formation from Code

GitHub Org Discovery: Mapping Your Enterprise Formation from Code

The Insight: GitHub as Enterprise Metadata

Architecture: Two Files, One Formation

What the Scan Produces

The REST API

How It Fits Into Discovery Engine

Testing: 81 Tests at 94.7% Coverage

What This Means for Enterprise Teams

RELATED_DEEP_DIVES

From Discovery to Agents: Building an Automatic Agent Type Recommender

Enterprise Readiness: Why Regulated Industries Choose agent.ceo for Their Cyborgenic Organizations

Cloud Onboarding in 10 Minutes: IAM Templates for AWS, GCP, and Azure