Skip to main content

AI Engineering · 2026-05-06 · 13 min read

Code Review Prompts for Claude: 12 That Actually Find Bugs

Most "Claude code review prompts" you'll find online are some variant of "Review this code and tell me what's wrong with it." That works for syntax errors and obvious smells, but it leaves an enormous amount of value on the table. Claude — particularly Sonnet 4.5 and Opus 4.7 — will review code as well as a competent senior engineer if you scope the request, give it the right context, and ask in a structure it can answer cleanly. It will produce shallow boilerplate if you don't.

This post walks through 12 prompts that consistently surface real issues in production code, organised by review category. They use the patterns Anthropic recommends in their prompt engineering guide: XML tags for structured input, scoped instructions, and explicit output format. Each is short enough to paste into Claude.ai or wire into the SDK in under a minute.

Why prompt structure matters more for code review than for chat

For a chat task ("explain promises in JavaScript"), the prompt structure barely matters — the model has the full request in its prefix and produces an open-ended response. For code review, three things change:

  • The input is large and structured. A diff plus surrounding context can run thousands of tokens. If you don't separate the diff from the instructions, the model treats the diff as part of the instructions and produces lower-quality output.
  • The output should be actionable, not narrative. A wall of prose telling you "this could potentially be improved" is worse than three numbered findings with file:line and a concrete suggestion.
  • You will pipe the output somewhere. CI comment, GitHub PR review, Slack — all of which need predictable formatting.

Anthropic's XML-tags guidance exists for exactly these cases: <diff>...</diff>, <context>...</context>, <rules>...</rules>. The model is trained to respect these boundaries. Use them.

The base template

Every prompt below assumes this scaffold. Drop it in once, then swap the <rules> block for the category-specific prompt:

You are reviewing a pull request. Focus only on the rules listed
below — do not comment on style, naming, or anything outside the
rule set.

<diff>
{paste git diff or full file here}
</diff>

<context>
Language: TypeScript / Node 22 / Express
Repo: API service for a B2B SaaS
{any other context the model needs}
</context>

<rules>
{specific review rules — see prompts below}
</rules>

Output format:
- One numbered finding per real issue.
- Each finding: file:line, severity (high/medium/low), one-sentence
  problem statement, one-sentence recommended fix.
- If you find nothing, output exactly: "No findings against the
  supplied rules."
- Do not pad. Do not summarise. Do not include disclaimers.

The "if you find nothing, output exactly X" line is the most important sentence. Without it, Claude — like most LLMs — will invent low-confidence findings to look helpful. With it, you get clean, trustworthy No findings responses you can wire into CI without false positives.

Security review prompts

1. Authentication and authorisation

<rules>
- Every route handler must check both authentication AND
  authorisation. Flag any handler that mutates data but only checks
  authentication.
- Flag any use of user-supplied IDs in WHERE clauses that doesn't
  also constrain by tenant_id / user_id.
- Flag any role check using string comparison without canonical
  casing.
- Flag any JWT verification that doesn't verify both signature AND
  expiration AND issuer.
- Flag any session ID generated with Math.random or any non-CSPRNG.
</rules>

The IDOR (Insecure Direct Object Reference) bullet — the WHERE clause one — is the single most common production vulnerability we've seen Claude catch in review. It looks innocent: SELECT * FROM invoices WHERE id = $1. The bug is that id came from req.params and any logged-in user can read any invoice. Claude flags this reliably when the rule is explicit.

2. Input validation and injection

<rules>
- Flag any string concatenation building SQL, shell commands,
  HTML, or file paths from user input.
- Flag any use of innerHTML, dangerouslySetInnerHTML,
  document.write, eval, or new Function with user-supplied data.
- Flag any file system path that joins user input without
  path.resolve + a confined-prefix check.
- Flag any regex built from user input without escaping.
- Flag any deserialisation of user input (JSON.parse on unbounded
  input is fine; YAML.load, pickle.loads, or Object.assign of
  arbitrary keys is not).
</rules>

3. Secrets and configuration

<rules>
- Flag any string that looks like a real secret (sk_live_, pk_live_,
  AKIA, github_pat_, ghp_, AIza, eyJ-prefixed JWTs longer than 40
  chars).
- Flag any hardcoded URL, port, or timeout that should be in env.
- Flag any process.env access without a defined fallback or an
  explicit "throw if missing" check at startup.
- Flag any logging statement that prints request bodies, headers,
  or environment variables in full.
</rules>

Performance review prompts

4. Database query patterns

<rules>
- Flag any query inside a loop (N+1 pattern).
- Flag any SELECT * on a table with more than 10 columns when only
  specific columns are used downstream.
- Flag any UPDATE or DELETE without a WHERE clause or with a WHERE
  clause that doesn't hit an index.
- Flag any ORDER BY on an unindexed column with a LIMIT > 100.
- Flag any transaction that wraps an external HTTP call.
</rules>

The "transaction wrapping an HTTP call" rule catches a class of bug that's almost invisible in code review by humans: a Postgres transaction held open for the duration of a Stripe API call, blocking other writes to the same rows for hundreds of milliseconds. The PostgreSQL docs are explicit that long-held row-level locks degrade concurrency. Claude finds this reliably with the rule above.

5. JavaScript / Node hot paths

<rules>
- Flag any synchronous file I/O (readFileSync, writeFileSync) inside
  a request handler.
- Flag any await inside a for/forEach where Promise.all would work.
- Flag any JSON.parse of an HTTP response without a size limit upstream.
- Flag any regex with nested quantifiers (catastrophic backtracking
  risk — flag /(a+)+/, /(a|aa)+/, /(.*)*$/, etc).
- Flag any Buffer or string allocation > 1 MB without justification.
</rules>

6. React / frontend rendering

<rules>
- Flag any inline object/array literal passed as a prop that is not
  intentionally re-creating identity.
- Flag any useEffect with no dependency array.
- Flag any useMemo/useCallback whose dependency array references
  values not in the closure (likely stale).
- Flag any list rendered without a stable key (key={index} on a
  reorderable list).
- Flag any setState call inside a render body.
</rules>

Correctness review prompts

7. Error handling

<rules>
- Flag any catch block that swallows the error (empty body, or only
  console.log without re-throw).
- Flag any async function called without await whose return value
  is discarded.
- Flag any Promise.all where a single rejection should not abort the
  rest (Promise.allSettled would be correct).
- Flag any try/finally that doesn't release the resource it acquired.
- Flag any throw of a string or plain object instead of an Error.
</rules>

8. Concurrency and idempotency

<rules>
- Flag any handler that performs an external side effect (charge,
  email, webhook delivery) without an idempotency key check.
- Flag any read-then-write sequence on the same row without an
  optimistic-lock version check or SELECT FOR UPDATE.
- Flag any background job that doesn't bound its retries or doesn't
  check whether the work is already complete on retry.
- Flag any setTimeout/setInterval inside a request handler that
  outlives the request.
</rules>

9. Edge cases the diff probably forgot

<rules>
- Empty input (empty string, empty array, empty object, null,
  undefined): does the change handle each?
- Input larger than expected (10,000-element array, 10 MB string):
  does the change have any unbounded operation?
- Concurrent invocation: if this code runs twice in parallel, what
  breaks?
- Network failure mid-operation: does the code recover or leave
  inconsistent state?
- Boundary numerics: 0, 1, -1, MAX_SAFE_INTEGER, NaN, Infinity, and
  for currency, the smallest representable unit (cents).
</rules>

Test review prompts

10. Test quality

<rules>
- Flag any test whose name does not describe the behaviour being
  asserted.
- Flag any test that asserts on implementation details (mock call
  counts on internal helpers) rather than observable behaviour.
- Flag any test that mocks the system under test.
- Flag any test that depends on Date.now, Math.random, or network
  without seeding/mocking — flaky.
- Flag any test using sleep/setTimeout to wait for an async result
  instead of awaiting a promise.
</rules>

11. Coverage gap detection

<rules>
- For every conditional branch in the diff, is there a test that
  exercises both sides?
- For every error path (throw, return Err, return null on failure),
  is there a test that triggers it?
- For every external dependency (DB call, HTTP request, FS access),
  is there either a real integration test or a mock?
</rules>

Readability and maintenance

12. Long-term maintainability

<rules>
- Flag any function longer than 50 lines.
- Flag any function with more than 4 parameters that aren't a
  cohesive options object.
- Flag any nesting deeper than 3 levels (early returns or
  extraction usually fix this).
- Flag any name that abbreviates without a clear convention
  (mgr, ctrlr, hndlr, svc — vs req/res which are conventional).
- Flag any TODO/FIXME without a ticket reference.
- Flag any commented-out code that wasn't deleted.
</rules>

Wiring it into a real workflow

Claude Code (CLI)

If you use Anthropic's Claude Code CLI, the natural pattern is a slash command. Save the base template plus a category as a .md file in .claude/commands/ and invoke it as /security-review against your staged diff:

cat <<'EOF' > .claude/commands/security-review.md
Run a security review against the currently staged changes.
Use the rules in security-review-prompt.md verbatim. Output as
GitHub-ready Markdown with file:line references.
EOF

SDK (Node)

For programmatic CI use — typical pattern is a GitHub Action that diff'd against base and posts a PR comment:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const review = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 2048,
  system: REVIEW_BASE_TEMPLATE,
  messages: [
    { role: "user", content: buildPrompt(diff, rules) }
  ]
});

const text = review.content
  .filter(c => c.type === "text")
  .map(c => c.text)
  .join("\n");

if (text.startsWith("No findings")) process.exit(0);
postPrComment(text);

Two production tips:

  • Cache the system prompt. Anthropic's prompt caching drops the cost of repeated rule sets by ~90%. The base template plus rules is a perfect cache target — it's identical across every PR.
  • Run categories in parallel, not sequentially. A single 12-rule mega-prompt produces worse output than 4 specialised prompts of 3 rules each. Token-budget the model into the right frame for each pass.

What Claude is good at vs what it isn't

Anthropic's Claude Sonnet 4.5 model card reports SWE-Bench Verified at 77.2% — meaning the model can resolve real GitHub bugs end-to-end on a substantial fraction of tasks. Code review is a meaningfully easier task than autonomous bug-fixing (you're not asked to write the patch, just to spot the issue), so expect strong results when the prompt is scoped well.

What it's reliably good at:

  • Pattern-matched bug classes the rule set names explicitly.
  • Walking through a diff systematically and flagging things humans skim past.
  • Catching missing branches in conditional logic when asked.
  • Spotting concurrency bugs in code under 200 lines.

What it's still weak at — and where you should not over-trust it:

  • Cross-file invariants. If a bug only exists because of an assumption in another file you didn't include in the context, Claude won't catch it.
  • Domain-specific correctness. "This rounding produces a $0.01 discrepancy in a billing edge case" requires the model to know your billing model. Either provide it in <context> or expect to miss this class.
  • Architectural critique. Claude is a sharp line-level reviewer; it's a much weaker architect. Don't ask it whether the design is good — ask it whether the code matches the design.

You should buy the prompt pack if…

The 12 prompts above are usable as-is and you can copy them into your own system today. The $15 Code Review Prompts pack is the right purchase if:

  • You want 50 of these, not 12 — broken into security, performance, tests, readability, docs, and accessibility — already organised for drop-in use.
  • You'd rather not spend an hour iterating on the wording yourself; we've already tested each prompt against real diffs and tightened the rules until they stopped producing false positives.
  • You want both the Markdown and JSON formats — Markdown for direct paste into Claude.ai, JSON for programmatic use in your CI pipeline.

It's the wrong purchase if you only need the handful of categories you can build yourself in 20 minutes. The 12 here will get most teams 80% of the way there.

Pair it with the Cold Outreach prompt pack ($15) if you do any sales work alongside engineering, or browse the full shop.

One last meta-prompt

The most useful prompt of all isn't on the list. After the model produces its review, follow up with:

Re-read your own review above. For each finding, ask: "Could this
be a false positive given the context provided?" Remove any finding
that you cannot defend with a specific line of code. Output the
filtered list.

This single self-critique pass typically removes 1-2 weak findings per review and dramatically increases trust in the remaining ones. It's the difference between a code-review tool engineers actually run and one that gets ignored after the third false alarm.

Related reading: HTML templates vs React · Build a developer portfolio in one afternoon

Newsletter

Get future posts + new templates

Occasional (monthly-ish) notes on what we've shipped + conversion experiments. Unsubscribe any time.