GitHub

The 10,000-Line PR Nobody Wants to Review

A developer I know recently shared a screenshot: a pull request with over 10,000 lines added. The reaction from everyone who saw it was the same — part horror, part resignation. "How are you supposed to review that?"

The honest answer from the team? "You either grab the person offline and do it together, or you don't review it at all."

This isn't a story about one bad PR. It's the new normal. AI coding agents — Claude Code, Codex, Cursor, Copilot — generate code at a pace that makes traditional review workflows feel like trying to drink from a fire hose. The bottleneck in software development has shifted. It's no longer about writing code. It's about everything that happens after the code is written.

The Speed Mismatch Problem

Here's the math that breaks your workflow:

A CodeRabbit study found that AI-authored pull requests produce 10.83 issues per PR compared to 6.45 for human-written ones — with 75% more logic and correctness errors. A separate study found that senior engineers spend an average of 4.3 minutes reviewing AI-generated suggestions, compared to 1.2 minutes for human-written code.

Think about what this means in practice. AI generates more code, that code has more issues, and each issue takes longer to evaluate. Meanwhile, CircleCI's analysis of 28 million CI workflows shows that while teams are creating far more code than ever, less of it is actually making it into production. The pipeline is clogged — and code review is the clog.

As LogRocket put it: code review — not code generation — is now the bottleneck to shipping.

Why Traditional Review Can't Scale

Traditional code review was designed for a world where a developer writes 100-300 lines of meaningful code per day. In that world, reviewing a colleague's 200-line PR is a 15-minute task. You know the codebase, you know the developer's patterns, you can reason about intent.

AI-generated PRs break every one of these assumptions:

Volume. An AI agent working on a feature can touch hundreds of files in a single session. Not because it needs to, but because it treats "while I'm here, let me fix this too" as the default behavior. A feature that a human would implement across 3-4 files becomes a 50-file PR.

Traceability. When a human writes code, the reviewer can infer intent from the changes. AI-generated code often lacks this legibility. Why did it restructure this function? Was it necessary for the feature, or was the agent "improving" things on its own? You can review an architecture plan, but there's no way to verify whether the generated code faithfully implements that plan without reading every line.

Duplication. AI agents are notorious for generating redundant code. They'll create a utility function in one file while a nearly identical one already exists elsewhere. Analysis of Claude Code's own codebase reportedly found significant amounts of duplicated code. When the author doesn't understand the codebase holistically, it shows.

Confidence. The code looks right. It follows conventions, has proper variable names, includes comments. But looking right and being right are different things. AI-generated code can pass a surface-level review while hiding subtle logic errors that only manifest in edge cases. This creates a false sense of security that's arguably worse than obviously bad code.

Three Strategies Emerging

Teams are converging on three distinct approaches to this problem. Most organizations will end up combining all three, but the mix varies based on risk tolerance and infrastructure maturity.

Strategy 1: Harness Engineering — Prevent Bad Code from Being Written

The hottest term in developer tooling right now is "harness engineering" — the discipline of designing systems that constrain AI agents before they produce output, rather than catching mistakes after.

As Martin Fowler's team describes it, a harness is the complete infrastructure that governs how an agent operates: the tools it can access, the guardrails that keep it safe, the feedback loops that help it self-correct, and the observability layer that lets humans monitor its behavior.

In practice, this looks like:

CLAUDE.md / AGENTS.md files that encode project conventions, architectural boundaries, and forbidden patterns directly in the repository. The AI reads these rules before writing any code.
Pre-commit hooks and linters that catch violations automatically — not as suggestions, but as hard blocks.
Structured workflows where the agent must produce a plan, get approval, then execute — with automated checks at each stage.
Worktree isolation so each agent session operates on its own branch, preventing concurrent agents from conflicting.

OpenAI's harness engineering guide for Codex emphasizes that better models make harness engineering more important, not less. More capable agents get more autonomy, and more autonomy demands better guardrails.

The philosophy is proactive: when an agent produces bad output, you don't just fix the output — you fix the harness so it can't happen again.

Strategy 2: Rollback-Friendly Infrastructure — Accept Failure, Recover Fast

The second strategy takes the opposite philosophical stance: instead of preventing all bad code from shipping, make it trivially easy to undo bad deployments.

This approach requires:

Feature flags to gate new functionality and kill it instantly if problems appear.
Blue-green or canary deployments that limit blast radius.
Automated rollback triggers based on error rate spikes or performance degradation.
Comprehensive observability so you know when something breaks, even if you didn't catch it in review.

Teams following this strategy essentially say: "We trust the tests and monitoring more than we trust human review at scale." The bet is that a well-instrumented production environment catches more real bugs than a human skimming through a 5,000-line diff.

This isn't reckless — it's the same philosophy that drove continuous deployment adoption years ago. But it requires mature infrastructure that many teams don't have yet.

Strategy 3: Policy-Based Review — Redefine What Gets Reviewed

The most pragmatic (and controversial) approach: change what your team agrees to review.

Some teams are explicitly deciding that certain categories of AI-generated changes don't require traditional review:

Auto-formatting and style changes — trust the linter.
Test additions — if they pass CI, ship them.
Dependency updates — automated tools handle this already.
Boilerplate and scaffolding — the AI follows templates; the templates were reviewed.

The review effort is concentrated on what humans are actually good at evaluating: architectural decisions, security boundaries, business logic correctness, and API contract changes.

This requires team alignment. As one developer put it: "It's a policy problem. If the team agrees not to review certain changes, it's solvable." But getting that agreement — and defining the boundaries precisely — is harder than it sounds.

The Tooling Response

The tooling ecosystem is scrambling to address this gap. JetBrains Air, launched in March 2026, positions itself as an "Agentic Development Environment" where multiple AI agents (Codex, Claude, Gemini CLI, Junie) can run independent task loops with built-in worktree management and review workflows. The pitch: if the IDE manages agent orchestration, it can also manage the review surface.

AI-powered code review tools like CodeRabbit, Greptile, and GitHub's Copilot-powered reviews are trying to use AI to review AI-generated code. Atlassian reported 30.8% faster PR processing with their AI reviewer, but speed isn't the same as quality. These tools are good at catching surface-level issues — the same things a linter catches — but struggle with the deeper judgment calls that make review valuable.

The fundamental irony: we're using AI to solve a problem that AI created.

The Testing Question

Buried in all of this is an equally important question about testing. When AI writes both the code and the tests, what confidence do the tests actually provide?

The emerging consensus among teams I've talked to is surprisingly minimal:

Unit tests for stateful logic (reducers, state machines, complex utilities) — yes, always.
Unit tests for simple components or straightforward functions — often skipped.
E2E tests — rarely written, and when they are, they're usually for critical user paths only.
Integration tests against real databases — preferred over mocked tests, since mocks can mask real failures.

The reasoning: if an AI wrote both the implementation and the tests, the tests probably encode the same assumptions (and same bugs) as the implementation. Tests are most valuable when they encode human understanding of correct behavior, not when they're a rubber stamp from the same system that wrote the code.

The Individual vs. Shared Tooling Debate

There's another tension playing out in teams adopting AI coding tools: should the AI configuration be shared or individual?

On one side, there's a push for standardization. A shared CLAUDE.md or AGENTS.md in the repo ensures every team member's AI follows the same conventions. Shared skills and agent configurations reduce duplicated effort and ensure consistency.

On the other side, many developers resist this. One senior engineer's take: "Even if the team publishes a shared config, I'd delete it and use my own." The reasoning isn't contrarian for its own sake — each developer has a different workflow, different strengths they want the AI to complement, and different weaknesses they want it to cover.

The layered approach that's emerging seems practical: shared project-level conventions in the repo (what the codebase requires), individual user-level preferences on each machine (how each developer works), and organizational policies for security and compliance. What shouldn't be shared is workflow — the sequence of steps, the level of autonomy, the review preferences. These are personal, and forcing standardization often means everyone gets a mediocre setup instead of one that's optimized for how they actually work.

What Actually Works Right Now

After researching this extensively and observing real teams navigate it, here's what I think works today:

Keep PRs small. This hasn't changed. The same AI tools that create monster PRs can be configured to work in smaller increments. Break features into sub-PRs against a feature branch. The tools that enforce small changes (like Graphite) are winning for a reason.

Invest in the harness before investing in review tools. It's cheaper to prevent bad code from being generated than to review it after the fact. A well-written CLAUDE.md with architectural boundaries, forbidden patterns, and testing requirements does more than any AI review tool.

Build rollback capability regardless. Whether you review thoroughly or not, you need the ability to undo. Feature flags and canary deployments aren't optional anymore — they're table stakes.

Don't skip review — change what you review. Human review should focus on intent, architecture, and security. Let automated tools handle style, formatting, and basic correctness. This isn't lowering the bar — it's moving human attention where it actually matters.

Let individuals configure their own workflow. Share conventions, not workflows. The developer who wants full plan-mode review and the developer who wants autonomous execution can both follow the same architectural rules while working in completely different ways.

The Deeper Question

There's something uncomfortable about this whole situation. We built AI tools to make us faster, and the first thing they did was expose how much of our process was held together by human bandwidth that we never quantified.

Code review worked when the review load was manageable. Testing strategies worked when humans wrote the tests with domain understanding. Architecture decisions worked when the person writing the code understood the system.

AI removed the constraint of writing speed and revealed that the real constraints were always comprehension, judgment, and trust. Those are the things that don't scale with faster models or bigger context windows.

The teams that navigate this well won't be the ones with the best AI tools. They'll be the ones that honestly assess which parts of their process require human judgment and which don't — then build infrastructure that matches that assessment.

The 10,000-line PR isn't the problem. It's the symptom. The problem is that we optimized for output velocity without upgrading the systems that handle that output. And that's a human engineering problem, not an AI one.

AI Writes Code 10x Faster. Your Team Reviews It at 1x. Now What?