Delegating to Machines: Why AI in Engineering Teams Needs Review Contracts, Not Just Adoption

· · Views: 1,829 · 6 min time to read

The release I had to roll back

A build went out on a Friday, and within a day we had four customer-reported bugs across two clients: a permission edge case that exposed the wrong data to the wrong role, a notification that fired twice for the same event, a date-formatting error in a generated report, and an export that silently dropped reviewer comments. None were catastrophic. They were the small, persistent kind that erode trust faster than a single big incident — and all four had been touched by AI assistance during the sprint.

I rolled the release back and sat down with my QA lead. The pattern was consistent: the AI had drafted test cases that looked thorough, a reviewer had skimmed and approved them, and nobody had paused to ask what the AI had not generated. Two of the four bugs lived in scenarios the AI never wrote a test for. The other two were in tests it did write but nobody had actually run, because the output looked confident enough to trust.

I had been running a 12-person product and engineering team across two SaaS products. We had used AI assistance heavily for three months, and by every dashboard I watched — output, velocity, morale — it looked like a clear win. I had quietly filed AI adoption under “productivity.”

That was the mistake. Treating AI as a productivity tool is the wrong frame. In an engineering team, AI is a delegation layer — and delegation without review contracts creates silent risk. The risk is silent precisely because the output looks finished: it uses the right templates, the right vocabulary, the right structure, and it passes the kind of review that was designed to catch sloppy human work, not confident machine work.

What follows is the framework I built to fix that, and what changed once I did.

Why ordinary delegation breaks

Traditional delegation is simple. You assign work to a person based on role and experience, they carry the context and own the outcome, and reviews are calibrated to the individual — a senior’s pull request gets lighter scrutiny than a junior’s. Put an AI agent into that loop and three things break at once.

Accountability blurs. When an AI drafted the code or the test, “who owned this” gets fuzzy. The engineer can quietly point at the agent; the agent points at nobody. After the rollback I had a version of this conversation with a senior engineer where we both spent ten minutes carefully not naming the model in our own reasoning. That is not a place you can manage from.

Calibration stops working. I used to know who needed heavy review and who needed light. Once AI is in the mix, the work product stops being a reliable proxy for the person. A junior using AI can produce output that looks senior and wrongly triggers light review; a senior using AI is simply a different risk profile than the same senior unaided.

Review economics invert. Old review caught problems by spotting what looked wrong. AI output rarely looks wrong, so reviewers have to catch problems by asking what is missing — which takes far more effort than scanning what is present. Most teams never make that shift, and the gap is where the silent risk lives.

Bounded delegation: the framework

After the rollback I stopped treating “we use AI” as a single decision and started treating it as a per-task delegation contract. Four parts.

  1. Classify work by reversibility, not complexity. The useful question is not how hard the task is but how easily a wrong output can be detected and undone. AI-drafted code in a quarterly internal script is cheap to recover from; it fails loudly. AI-drafted logic in permission-checking middleware is not; it ships, stays quiet, and erodes trust. I sort every workstream into three buckets — cheap to recover from, expensive to recover from, can’t recover from — before deciding what AI is allowed to do in it.
  2. Write the review contract before the work starts. For each bucket, agree up front what human review is non-negotiable and who holds it. Cheap-to-recover work can be AI-drafted and skim-reviewed. Expensive-to-recover work is AI-drafted but a named senior reads it and signs off, and the sign-off is logged. Can’t-recover work — authentication, permissions, billing, anything generated that reaches a customer unedited — is human-led, with AI used only as a sounding board. The rule I made the team memorise: the agent doesn’t get to draft what humans don’t have time to verify.
  3. Make AI involvement visible in the artefact. Every AI-touched artefact carries a small marker — a structured code comment, a ticket tag, a line in the release log — recording where AI assisted, what was kept, and who reviewed it. The point is not shame; it is traceability. When something breaks six weeks later, the marker tells you whether AI was in the chain and where the review should have caught it. Without it, you cannot learn from your own failures.
  4. Review the contract regularly. Model capability and team maturity both move faster than expected. Work that was expensive-to-recover in February became cheap-to-recover by August once we had better automated checks around it; other work moved the opposite way once I saw its downstream effects. A quarterly review of the bucket assignments keeps the framework honest.

Everything else — training, KPIs, hiring — sits underneath these four.

What I changed underneath the framework

I rewrote the team KPIs to account for AI involvement. Sprint velocity stopped being a quality proxy the moment AI entered the loop, so I kept tracking it but stopped trusting it, and added four signals: defects linked to missed requirements, AI-generated artefacts rejected at review, review time per artefact, and recurrence of root-cause categories across sprints. The one I watch most is the rejection rate. If a person’s AI output is being accepted at 95%-plus, that is not a sign of quality — it is a sign the review is not a review.

I also changed hiring, which I think is the change that compounds the most. Engineers and testers have long been judged mostly on what they produce — and AI now produces a lot of what used to be the evidence. So I added two assessments. The first is a short take-home: candidates get an AI-generated artefact with two plausible errors planted in it and are asked to review it, with no hint that anything is wrong. I do not grade on how many they catch; I grade on how they reason about what to trust. “Looks fine to me” is a polite no. The second is a conversation about a real time they disagreed with an AI suggestion and what they did. A candidate who has never disagreed with one is not the senior I need. The criterion is not whether they use AI — it is whether they have a working relationship with disagreement.

What changed, with the honest caveat

These are internal observations from one team and two products over roughly two quarters. They are not benchmarks, the sample is small, and I cannot fully separate the effect of the framework from the team’s own learning curve. With that stated plainly:

Throughput per engineer rose by roughly 70% — measured as completed story points adjusted for re-opened work and post-release defects, against the pre-rollback baseline. End-to-end time from refinement-ready to production-stable fell about 65% in the first quarter and closer to 70% in the second, almost entirely from fewer mid-sprint clarifications and fewer rollbacks rather than faster typing. Rework — the single biggest cost driver — dropped visibly, with no change in team size. And review time per artefact went up first, by maybe 40% in the opening month as people learned to interrogate AI output, before settling slightly below the old baseline once the questions became habit. If you do this, expect that early dip. It is the cost of buying the discipline, not a problem to fix.

A short playbook

If you take nothing else, take these:

  • Classify AI-assisted work by reversibility, not complexity. How easily can a wrong output be caught and undone?
  • Never let AI draft work that humans cannot verify. If no one has time to check it, AI does not get to produce it.
  • Track AI rejection rate, not just AI adoption. Near-total acceptance means the review has quietly stopped happening.
  • Make AI involvement visible in tickets, code, test packs, and release notes, so failures are traceable.
  • Review the delegation contract on a schedule. Capability and team maturity drift; the buckets must drift with them.
  • Hire for the ability to challenge AI output, not just use it. Reasoning about what to trust is the skill that now separates strong engineers from fast ones.

Closing

Autonomous AI agents are not making management less important. They are making it more exacting. The old craft — coordination, escalation, execution — still matters, but it is no longer enough. The new work is quieter: classify by reversibility, write the review contract before the work, make AI involvement visible, and hire for judgement rather than output.

I learned it the hard way, with a rollback on a Friday and a slow conversation the following Monday. The team I run now delivers more from the same headcount and trusts its own releases more than it used to. None of that came from the tools. It came from changing how I delegated.

Share
f 𝕏 in
Copied