← Work

The Agentic CI Platform

2.2M minutes of toil eliminated · CI queue 4h → 15m · failure detection 126s → 3s

The Problem

At Waters Corporation, every line of code written by every engineer on every global team passes through a single developer platform. When I took ownership, that platform was a conventional CI system — functional, but passive. It waited for humans to notice when things broke.

The queue would fill. Failures would pile up. Engineers would context-switch to triage something that had been broken for two hours. A root cause that took four minutes to understand would cost twenty because someone had to discover it, route it, and own it.

The system wasn't broken. It was just asking humans to do what agents should do.

The Insight

The problem wasn't throughput. It was architecture.

A CI system designed around human-in-the-loop triage will always have a ceiling — because human attention is the bottleneck. Every alert that requires a human to read, interpret, and decide is compounding toil. At scale, this becomes catastrophic: not because any single incident is expensive, but because the aggregate cost of constant low-grade interruption destroys flow.

The solution wasn't faster pipelines. It was removing humans from the critical path.

The Solution

I rebuilt the platform around a perception-reasoning-action loop:

Perceive — The system ingests code change metadata, historical failure patterns, and test signal in real-time. It builds a risk model for every change before a single test runs.

Reason — When failures occur, an autonomous triage layer cross-references the failure against historical patterns, diffs, and dependency graphs. It surfaces a root cause hypothesis with confidence scores — not a raw log dump for a human to parse.

Act — Low-confidence, high-risk failures get routed to engineers with context pre-loaded. High-confidence, known failures get closed automatically. The loop completes without waiting for a human to notice something broke.

The Outcome

  • CI queue time: 4 hours → 15 minutes
  • Failure detection: 126 seconds → 3 seconds
  • Root cause identification: 320 seconds → 36 seconds
  • Engineering toil eliminated: 2.2 million minutes

The most meaningful outcome wasn't the numbers — it was the shift in how engineers related to the system. It stopped being something they managed. It became something that managed itself.

That's what happens when you stop thinking in automations and start thinking in agents.