Developer Toil Is a Product Problem

Every quarter, engineering organizations spend thousands of hours on work that produces no value.

I don't mean meetings or process overhead — though those compound too. I mean the raw mechanical labor of software delivery: triaging a failed build, waiting for a queue to clear, re-running a flaky test, writing a post-mortem for an incident that will happen again next week.

This is toil. And most companies treat it as fixed cost.

The Headcount Trap

The standard response to developer toil is more people. Hire another SRE. Hire a platform engineer. Build a team to manage the CI system. Scale headcount with the complexity of the system.

This feels rational. It isn't.

What you've actually done is hired humans to do work that scales linearly with code volume, in a system that compounds in complexity. Every engineer you add to manage toil is an engineer who isn't building product. And the toil doesn't go away — it just gets distributed.

The correct response to toil isn't headcount. It's to stop generating the toil.

What Agents Change

The reason toil persisted so long as a headcount problem is that the alternative — automation — required too much human investment to maintain. You'd build a script to auto-triage failures, and then the failure patterns would change, and someone would have to update the script. You'd replaced one form of toil with another.

Agents change this.

An agent that understands failure patterns doesn't need to be updated when those patterns change. It updates itself. An agent that reasons about risk doesn't need a new rule when a new risk emerges. It applies its existing understanding to the new context.

This is not the same as automation. Automation encodes what you know. Agents extend beyond what you knew when you built them.

The Product Framing

Here is what I've found to be true: developer toil is always a symptom of a product that was designed for a human-in-the-loop world.

The CI system that requires human triage was designed that way. The deployment pipeline that requires human approval on routine changes was designed that way. The incident response process that pages a human before gathering context was designed that way.

These are product decisions. And like all product decisions, they can be revisited.

The question I ask about every system I own: what in this system is a human doing that an agent should do?

The answer is almost always more than you'd expect.