I've been using Claude Code heavily at work. Like, eight hours a day speaking to 6 different chatbots at the same time, working on 6 different tickets at the same time. And the more I use it, the more convinced I am of something that sounds absolutely "illegal" to say out loud:
We're doing code review wrong. Not a little wrong. Structurally wrong.
Not because code review is bad โ but because our default assumption is outdated:
"A human wrote every line, so a human should read every line."
That world is fading. And if you keep reviewing like it's 2016 while code is being produced like it's 2026, your process is going to break. Whether you're an engineer, an engineering manager, or leading a team that ships software โ this affects how you build, staff, and measure.

We Stopped Reading Assembly โ But This Isn't Quite The Same Move
We've done this before. When programming moved from assembly to higher-level languages, nobody said "wait, we need to review the assembly output." We trusted the abstraction. We moved up the stack. We focused on intent, behavior, and outcomes.
AI-generated code is a similar shift, just way way bigger โ and with a critical difference.
We already tolerate opaque internals constantly. You're not reading every npm package you install. You're not auditing React internals before shipping. Nobody reviews the JavaScript that TypeScript compiles down to. Nobody inspects the server configs AWS provisions when you write Terraform. You write a SQL query and let the query planner decide how to execute it.
We trust abstractions every single day. AI-generated code is the next one. But here's the honest caveat: compilers are deterministic and formally verified. LLMs are not. Two identical prompts can produce different code. Worse, AI agents actively drift from their own instructions as the context window grows. The system prompt says "don't insert default values" and the agent does it anyway three thousand tokens later. This isn't a bug that gets fixed with better models โ it's fundamental to how these systems work.
That doesn't mean the shift isn't happening. It means the guardrails need to be stronger, not weaker. The burning question is: what should code review become?

Review The Plan + The Evidence, Then Skim The Diff For Boundary Mistakes
If implementation is increasingly generated, the highest-leverage review shifts earlier: plans, constraints, interfaces, architecture, risks, tests.
Claude Code has a plan mode where it proposes changes before executing them. It reads CLAUDE.md files containing your project's conventions, architecture decisions, and coding standards โ the root one loads at session start, and nested ones load on-demand as it works through your codebase. Engineers are already writing intent and constraints rather than code.
But "review the plan, not the diff" is an overcorrection. The better frame:
Review the plan. Review the evidence. Then skim the diff for boundary mistakes.
For most changes, you're asking: Is the plan sound? Are the constraints right? Are the edge cases covered? Do the tests prove the behavior? That's higher-leverage than line-by-line nitpicking.
But for auth, payments, PII handling, and permission checks โ you still want targeted line-reading of the critical paths, even when the plan looks solid. Agents drift. Plans don't capture every implementation nitpick. "Boundary code" always deserves human eyes on the diff.
That's not "no review." That's reviewing the right layer for the right kind of change.
The Two-Agent Review Pattern
The Claude Code team at Anthropic is already doing this: Claude A writes the plan, Claude B reviews it as a "staff engineer" โ skeptical, looking for edge cases, questioning assumptions.
When things go sideways during implementation, the team's rule is stop pushing forward โ go back to Plan mode and re-plan from scratch. A fresh plan with knowledge of what failed consistently beats incremental patching.
The Engineer's Role Is Expanding
When implementation is no longer the bottleneck, you have bandwidth to think more holistically. You're not heads-down coding for 6 hours to get a single feature working. You're spending 30 minutes in one terminal getting Claude Code to implement it, and the rest of your time thinking about whether this is the RIGHT thing to build, whether the user flow makes sense, and whether the architecture will scale.
Engineers are going to become way more business and product minded. You'll need to work with Product Managers more closely. You need to understand the WHY behind features at a deeper level. Part UX thinker, part product strategist, part architect, but still mostly engineer โ because you're the one making hundreds of micro-decisions about how it should all work.
The Review Bottleneck Nobody Wants To Admit

Here's the dirty secret: code is being written WAY faster than it can be reviewed.
Faros AI analyzed telemetry across 10,000+ developers and found teams with high AI adoption merged 98% more PRs, but PR review time increased 91%. You're producing twice the output but reviews take almost twice as long.
And that's before parallelization. Engineers running 3โ5 simultaneous Claude Code sessions across git worktrees are generating multiple PR streams at once. One team described their situation bluntly: a small bug fix sitting in merge request purgatory for two days, a new feature taking three days just to get reviewed โ not built, just reviewed. Code review had stopped being their quality gate and started being their bottleneck.
Teams handling 10โ15 PRs a week are now staring at 50โ100. If review stays "carefully read everything" while AI turns every engineer into a PR factory, your org becomes a firehose with a human-sized funnel. Something has to give.
A Review Stack That Actually Scales

Tiered review systems.
Tier 1: fully automated โ linting, static analysis, unit tests, security scanning, type checking. No human involved.
Tier 2: peer review for behavior, correctness, and "does this match the intent?"
Tier 3: senior/security review for critical paths โ auth, payments, PII, system boundaries. Most changes should never need Tier 3.
AI-assisted review at every layer โ not just CI. Most teams think of AI review as something that happens after a PR is opened. That's too late. The key insight: don't let the agent grade its own homework. A separate subagent catches semantic issues before the code ever reaches a human โ in under a minute.
On the CI side, teams are building pipelines that pull context from JIRA tickets and merge request descriptions, route to tech-stack-specific review agents, and post inline comments directly on the MR. Reviews that took 2โ3 days dropped to hours. Layer it: AI review during development via hooks, AI review at the PR level via CI, then humans for judgment calls.
PR size caps. If AI can generate 2,000 lines in minutes, you need mechanical limits. Cap at ~400 or so lines depending on your team's preferences. Force decomposition. Smaller PRs are easier to review, reason about, and roll back. Include an escape hatch for pure mechanical refactors (formatting, renames) that are auto-verified and reviewed differently.
Outcome-based verification. Shift from "I read it" to "I proved it works." Boris Cherny, the lead inventor of Claude Code, puts it this way: the most important thing for great results is giving Claude a way to verify its own work. If it has that feedback loop, quality goes up dramatically.
Strong merge gates. CI green, security checks clean, type/lint clean, coverage met. Automate what machines are better at, reserve humans for product, architecture, and risk.
Guardrails Become The Product
If we stop line-reading everything, safety mechanisms stop being nice-to-haves and become the foundation.
AI-generated code has its own characteristic failure patterns โ silently inserting default values instead of surfacing errors, drifting from its own instructions as the context window grows, defaulting to vague names that describe nothing about the business domain. These aren't the same bugs humans write, and your guardrails need to catch them specifically.
Testing becomes the primary evidence โ but not the only evidence. High coverage on critical flows, integration/e2e gates where failures are expensive, regression tests for every real bug you ship. Your tests are the contract. But remember: tests are also code, and they're increasingly AI-generated too. Test quality itself needs review, especially for critical paths. Coverage metrics can be gamed or misleading โ what matters is that tests prove dangerous behavior actually works correctly.
Security checks become default. Static Application Security Testing, dependency scanning, secret detection, injection checks โ running on every PR before a human looks at it.
Type checking, linting, and static analysis become blocking. No exceptions. No vibes. Doesn't pass? Doesn't ship.
Architecture Decision Records become infrastructure. ADRs documenting the WHY behind choices. Claude Code's CLAUDE.md is a lightweight version โ constraints that live in the repo so the system doesn't reinvent itself every PR. The teams getting the most out of it treat CLAUDE.md as a living document: when Claude makes a mistake, tell it to update its own rules. Over time it becomes institutional memory โ a compounding record of lessons learned.
Observability gets more important. When something breaks in code you didn't write and didn't read, debugging becomes forensics. Logging, tracing, metrics, and alerts are how you keep sanity.
Shared tooling enforces guardrails at scale. Custom /slash commands, MCP servers, hooks โ bake the rules into the tooling instead of relying on individuals to remember them. Every file edit can trigger auto-formatting. Every commit can run lint. Every PR can kick off your full test suite.

Self-Healing Pipelines: Where This Could Be Headed
This part is speculative, and I want to be honest about the gap between where teams are today and this vision.
What if the pipeline itself detected an issue, auto-generated a regression test, produced a fix, ran it through all guardrails, and redeployed? Deploy โ Monitor โ Detect โ Test โ Fix โ Redeploy. The human decides POLICY. Not the person writing the patch at 2am.
Teams are already partway there. Engineers on the Claude Code team point Claude at Docker logs and say "fix it." When CI fails: "Go fix the failing CI tests." Claude reads the output, traces failures to code changes, and produces fixes.
But let's be clear about the gap: "Claude fixes what I point it at" works for straightforward failures. Production incidents involving subtle state corruption, race conditions, or emergent behavior from multiple services interacting โ that's a different beast. The realistic next step for most organizations is "AI proposes + proves + stages, human approves the production rollout." Full autonomy is a longer horizon, especially in regulated environments where compliance requires human sign-off.
The test suite becomes a living document of everything that's ever gone wrong. If every production incident automatically spawns a test, coverage grows organically from real-world failures. That's better signal than tests a human guessed might be important upfront.
But What About Junior Engineers?
The honest answer: the junior role as we knew it changes significantly. But "junior jobs are dead" is wrong โ the apprenticeship just looks different.
The old model โ get assigned a bug, write the code, mid/senior reviews and teaches โ breaks when AI writes the implementation. The floor got raised. Boris Cherny has observed that senior developers outperform juniors with AI agents because they have decades of pattern recognition. They know what good code looks like, what to ask for, and when the AI is wrong.
Here's the thing nobody talks about: seniors built that pattern recognition by spending years in the weeds. If you remove the weeds, you need to deliberately construct a replacement learning path. Just saying "learning shifts upward" isn't enough.
What that path actually looks like:
Writing and maintaining tests and contracts โ this is where juniors learn what "correct behavior" means, what edge cases exist, and how systems actually break. It's still deep technical work.
Debugging and incident response โ when something breaks in code the AI wrote, tracing the failure is a masterclass in how systems work. Observability, log reading, hypothesis testing โ all of this builds the same intuition that debugging your own code used to build.
Turning vague tickets into crisp specs and ADRs โ this is the skill that matters most in an AI-assisted world. Juniors who learn to think clearly about requirements, constraints, and edge cases before any code is written will level up fast.
Learning to spot boundary and risk issues in reviews โ even if you're not line-reading everything, knowing where to look and what smells wrong is a learnable skill that compounds over time.
Mentorship flips accordingly: seniors review prompts, plans, and constraints instead of code. "Your implementation is wrong" becomes "your specification was incomplete."
Interviews need to change too. Grinding Leetcode matters less when AI implements any algorithm you describe. System design thinking from day one matters more. The ability to evaluate, scope, and architect โ that's what separates engineers now.
"But LLMs Aren't Compilers"
True. And it's the best counterargument โ one I took seriously enough to weave throughout this piece rather than bury at the end.
The assembly analogy is useful, but it has real limits. Compilers are deterministic โ same input, same output. LLMs are probabilistic. And it's worse than non-determinism: agents actively drift from their own instructions as the context window grows. Your guardrails can't just be suggestions in a CLAUDE.md. They need to be enforced by tooling that runs regardless of whether the agent remembered the rules.
And let's be honest: human-written code isn't deterministic either. Give the same ticket to five engineers and you get five implementations. We've always dealt with variability. The difference now is we can't pretend a human can manually validate everything when output is scaling 10x.
That's why everything in this piece comes back to the same foundation: strong automated gates, risk-tiered human review, and evidence over eyeballing.
TLDR: What I'd Actually Do Next Week

Cap PR size at ~400 or so lines โ if a change is bigger, break it into smaller pieces. Enforce this mechanically, not with honor system. Include an escape hatch for pure mechanical refactors that are auto-verified.
Create a risk tier rubric โ not every change needs the same scrutiny. Low-risk gets automated checks only. High-risk (auth, payments, user data) gets senior eyes and targeted diff reading.
Require a PR artifact for every change โ every PR must include: a behavior change summary, the risk tier and why, test evidence (what proves it works), and a rollback plan. This gives reviewers a structured entry point instead of "read the diff and figure it out."
Make automated checks mandatory โ type checking, linting, security scans, and tests must pass before anyone can merge. No exceptions, no "just this once."
Require real tests for risky changes โ not just hitting a coverage number, but tests that actually prove the dangerous behavior works correctly.
Set up AI review during development, not just after โ use hooks to have a separate AI agent review code as it's being written, so issues get caught before a PR is even opened.
Document your rules where the AI can read them โ CLAUDE.md files in every single repo. When Claude makes a mistake, have it update its own rules so it doesn't repeat it.
Make "show me it works" part of review โ staging environments and previews prove behavior better than reading diffs ever will.
Track your review queue โ measure how long PRs wait for review. This is where the bottleneck shows up first, and you need to see it coming.
Raise the floor with automation and guardrails. Spend human attention where it's actually valuable.
The Bottom Line
The role is shifting from "write code, review code" to "design systems, set constraints, define policy, verify outcomes." That's not a downgrade. That's an upgrade โ if you earn it with infrastructure.
Stop reading every line of code as the default. But don't confuse that with "trust the machine blindly."
Trust the guardrails. Trust the tests. Trust the gates. Save human judgment for the things only humans can judge โ and make sure human judgment still looks at the boundaries where the real damage happens.
I've been a software engineer for close to 5 years, recently using Claude Code + Cursor extensively in production workflows. These are observations from the trenches, not theoretical predictions. The shift is already happening โ prepare yourselves!
