Multi-AI Code Review

Overview

multi-ai-code-review provides comprehensive code review using multiple AI models as specialized agents, each analyzing code from a different perspective. Based on 2024-2025 best practices for AI-assisted code review.

Purpose: Multi-perspective code quality assessment using AI ensemble with human oversight

Pattern: Task-based (5 independent review dimensions + orchestration)

Key Principles (validated by tri-AI research):

Multi-Agent Architecture - Specialized agents for each review dimension
LLM-as-Judge Consensus - Flag issues only when 2+ models agree
Progressive Severity - Critical → High → Medium → Low prioritization
Human-in-Loop - AI suggests, human decides
Quality Gates - Block merges for critical unresolved issues
Actionable Feedback - Every comment has What/Where/Why/How

Quality Targets:

False Positive Rate: <15%
Fix Acceptance Rate: >40%
Review Turnaround: <5 minutes
Bug Catch Rate: >30% pre-production

When to Use

Use multi-ai-code-review when:

Reviewing pull requests (any size)
Auditing code quality before release
Establishing consistent code review standards
Security auditing code changes
Performance profiling changes
Technical debt assessment
Onboarding reviews (mentorship mode)

When NOT to Use:

Trivial changes (typos, comments only)
Automated dependency updates (use dependabot labels)
Generated code (migrations, scaffolds)

Prerequisites

Required

Code to review (diff, file, or directory)
At least one AI available (Claude required, Gemini/Codex optional)

Integration

GitHub Actions (optional, for CI/CD)
Pre-commit hooks (optional, for local checks)

Review Dimensions

5-Dimensional Analysis

| Dimension | Agent | Focus | Weight | |-----------|-------|-------|--------| | Security | Security Specialist | OWASP Top 10, secrets, injection | 25% | | Performance | Performance Engineer | Complexity, memory, latency | 20% | | Maintainability | Architect | Patterns, modularity, DRY | 25% | | Correctness | QA Engineer | Logic, edge cases, tests | 20% | | Style | Nitpicker | Naming, formatting, conventions | 10% |

Severity Levels

| Level | Action | Examples | |-------|--------|----------| | Critical | Block merge | SQL injection, exposed secrets, data loss | | High | Require fix | Race conditions, missing auth, memory leaks | | Medium | Suggest fix | Code duplication, missing tests, complexity | | Low | Optional | Style issues, naming, minor refactors |

Operations

Operation 1: Quick Security Scan

Time: 2-5 minutes Automation: 80% Purpose: Fast security-focused review

Process:

Scan for Critical Issues:

Review this code for security vulnerabilities:
- SQL injection
- XSS vulnerabilities
- Hardcoded secrets/API keys
- Authentication bypasses
- Authorization flaws
- Input validation gaps
- Insecure dependencies

Code:
[PASTE CODE OR DIFF]

For each issue found, provide:
- Severity (Critical/High/Medium)
- Location (file:line)
- Description (what's wrong)
- Fix (specific code change)

Validate with Gemini (optional):

gemini -p "Verify these security findings. Are any false positives?
[PASTE CLAUDE FINDINGS]

Code context:
[PASTE RELEVANT CODE]"

Output: Security report with consensus findings

Operation 2: Comprehensive PR Review

Time: 10-30 minutes Automation: 60% Purpose: Full multi-dimensional review

Process:

Step 1: Gather Context

# Get PR diff
git diff main...HEAD > /tmp/pr_diff.txt

# Identify affected areas
grep -E "^(\\+\\+\\+|---)" /tmp/pr_diff.txt | head -20

Step 2: Run Parallel Agent Reviews

Use Task tool to launch parallel agents:

Launch 3 parallel review agents:

Agent 1 (Security):
"Review this diff for security issues. Focus on:
- OWASP Top 10 vulnerabilities
- Authentication/authorization
- Input validation
- Secrets exposure
Diff: [DIFF]"

Agent 2 (Maintainability):
"Review this diff for maintainability. Focus on:
- Design patterns used correctly
- Code duplication (DRY)
- Modularity and cohesion
- Documentation quality
Diff: [DIFF]"

Agent 3 (Correctness):
"Review this diff for correctness. Focus on:
- Logic errors
- Edge cases not handled
- Test coverage gaps
- Error handling
Diff: [DIFF]"

Step 3: Orchestrate & Deduplicate

Synthesize findings from all agents:
[PASTE ALL AGENT OUTPUTS]

Tasks:
1. Remove duplicate findings
2. Rank by severity (Critical > High > Medium > Low)
3. Group by file
4. Generate summary table
5. Create final report with consensus issues only

Step 4: Generate Report

Output format:

## PR Review Summary

| File | Risk | Issues | Critical | High | Medium |
|------|------|--------|----------|------|--------|
| auth.py | High | 3 | 1 | 2 | 0 |
| api.py | Medium | 2 | 0 | 1 | 1 |

### Critical Issues (Block Merge)
1. **[auth.py:45]** SQL Injection vulnerability
   - Why: User input directly in query
   - Fix: Use parameterized queries

### High Issues (Require Fix)
...

### Consensus Score: 72/100
- Security: 65/100
- Performance: 80/100
- Maintainability: 70/100
- Correctness: 75/100
- Style: 85/100

Operation 3: LLM-as-Judge Tribunal

Time: 5-15 minutes Automation: 70% Purpose: High-confidence findings through consensus

Process:

Run Code Through Multiple Models:

Claude Analysis:

Analyze this code for issues. Rate severity 1-10 for each:
[CODE]

Gemini Analysis (via CLI):

gemini -p "Analyze this code for issues. Rate severity 1-10 for each:
[CODE]"

Codex Analysis (via CLI):

codex "Analyze this code for issues. Rate severity 1-10 for each:
[CODE]"

Calculate Consensus:

Given these analyses from 3 AI models:

Claude: [FINDINGS]
Gemini: [FINDINGS]
Codex: [FINDINGS]

Identify issues where at least 2 models agree:
1. List consensus findings
2. Average severity scores
3. Note any disagreements
4. Final verdict for each issue

Output: High-confidence issue list (≥67% agreement)

Operation 4: Mentorship Review

Time: 15-30 minutes Automation: 40% Purpose: Educational code review for learning

Process:

Review this code in mentorship mode. For a developer learning [LANGUAGE/FRAMEWORK]:

Code: [CODE]

For each finding:
1. **What's the issue** (be encouraging, not critical)
2. **Why it matters** (explain the underlying concept)
3. **How to improve** (show before/after with explanation)
4. **Learn more** (link to relevant documentation)

Also highlight:
- What was done well
- Good patterns to continue using
- Growth opportunities

Tone: Supportive and educational, never condescending.

Operation 5: Pre-Release Audit

Time: 30-60 minutes Automation: 50% Purpose: Comprehensive review before production

Process:

Full Codebase Scan:

# Identify all changes since last release
git diff v1.0.0...HEAD --stat
git log v1.0.0...HEAD --oneline

Security Deep Dive:

Run all security checks
Verify no new vulnerabilities
Check dependency updates
Audit secrets management

Performance Review:

Identify potential bottlenecks
Review database queries
Check for N+1 problems
Validate caching strategies

Test Coverage:

Verify test coverage targets
Check critical path coverage
Validate edge case tests

Generate Release Report:

## Pre-Release Audit: v1.1.0

### Security Clearance: PASS ✓
- No critical vulnerabilities
- All high issues resolved
- Secrets audit: Clean

### Performance Assessment: PASS ✓
- No new N+1 queries
- Response time within SLA
- Memory usage stable

### Test Coverage: 82% (target: 80%)
- Critical paths: 95%
- Edge cases: 78%

### Release Recommendation: APPROVED

Multi-AI Coordination

Agent Assignment Strategy

| Task | Primary | Verification | Speed | |------|---------|--------------|-------| | Security scan | Claude | Gemini | Fast | | Architecture review | Claude | Codex | Medium | | Logic validation | Codex | Claude | Medium | | Style checking | Gemini | Claude | Fast | | Performance analysis | Claude | Codex | Medium |

Coordination Commands

Launch Multi-Agent Review:

# Using Task tool for parallel execution
# Each agent reviews independently, orchestrator synthesizes

Gemini Quick Check:

gemini -p "Quick security scan of this code: [CODE]"

Codex Deep Analysis:

codex "Analyze this code architecture and suggest improvements: [CODE]"

CI/CD Integration

GitHub Actions Workflow

# .github/workflows/ai-review.yml
name: Multi-AI Code Review
on: [pull_request]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR Diff
        run: |
          git diff origin/main...HEAD > pr_diff.txt

      - name: Claude Review
        uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          model: "claude-sonnet-4-5-20250929"
          review_level: "detailed"

      - name: Post Summary
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## AI Review Summary\n${process.env.REVIEW_SUMMARY}`
            })

Quality Gate Configuration

# Block merge for critical issues
quality_gates:
  critical_issues: 0      # Must be zero
  high_issues: 3          # Max allowed
  coverage_minimum: 80    # Percent
  score_minimum: 70       # Out of 100

Quality Scoring

Scoring Formula

Overall = (Security × 0.25) + (Performance × 0.20) +
          (Maintainability × 0.25) + (Correctness × 0.20) +
          (Style × 0.10)

Grade Mapping

| Score | Grade | Status | |-------|-------|--------| | ≥90 | A | Excellent - Ship it | | 80-89 | B | Good - Minor fixes | | 70-79 | C | Acceptable - Address issues | | 60-69 | D | Needs work - Significant fixes | | <60 | F | Failing - Major revision needed |

Anti-Patterns to Detect

Hardcoded Secrets - API keys, passwords in code
SQL Injection - Unparameterized queries
XSS Vulnerabilities - Unsanitized output
Race Conditions - Unprotected shared state
Memory Leaks - Unclosed resources
N+1 Queries - Loop database calls
Dead Code - Unreachable branches
God Objects - Classes doing too much
Copy-Paste Code - Duplicated logic
Missing Error Handling - Unhandled exceptions

Example Review Session

User: Review this PR for my authentication module

Claude: I'll perform a comprehensive multi-dimensional review.

[Launches parallel agents for security, maintainability, correctness]

## PR Review: Authentication Module

### Critical Issues (1)
1. **[auth.py:67]** Password stored in plaintext
   - Severity: Critical
   - Consensus: 3/3 models agree
   - Fix: Use bcrypt hashing
   ```python
   # Before
   user.password = request.password

   # After
   import bcrypt
   user.password = bcrypt.hashpw(request.password.encode(), bcrypt.gensalt())

High Issues (2)

[auth.py:45] No rate limiting on login endpoint
[auth.py:89] JWT secret hardcoded

Quality Score: 58/100 (Grade: F)

Security: 35/100 (Critical issues)
Performance: 70/100
Maintainability: 65/100
Correctness: 60/100
Style: 80/100

Recommendation: BLOCK MERGE

Resolve critical security issues before merging.


---

## Related Skills

- **multi-ai-testing**: Generate tests for reviewed code
- **multi-ai-verification**: Validate fixes
- **multi-ai-implementation**: Implement suggested fixes
- **codex-review**: Codex-specific review patterns
- **review-multi**: Skill-specific reviews

---

## References

- `references/security-checklist.md` - OWASP Top 10 checklist
- `references/performance-patterns.md` - Performance anti-patterns
- `references/ci-cd-integration.md` - Full CI/CD setup guide

Multi-AI Code Review

Overview

Purpose: Multi-perspective code quality assessment using AI ensemble with human oversight

Pattern: Task-based (5 independent review dimensions + orchestration)

Key Principles (validated by tri-AI research):

Multi-Agent Architecture - Specialized agents for each review dimension
LLM-as-Judge Consensus - Flag issues only when 2+ models agree
Progressive Severity - Critical → High → Medium → Low prioritization
Human-in-Loop - AI suggests, human decides
Quality Gates - Block merges for critical unresolved issues
Actionable Feedback - Every comment has What/Where/Why/How

Quality Targets:

False Positive Rate: <15%
Fix Acceptance Rate: >40%
Review Turnaround: <5 minutes
Bug Catch Rate: >30% pre-production

When to Use

Use multi-ai-code-review when:

Reviewing pull requests (any size)
Auditing code quality before release
Establishing consistent code review standards
Security auditing code changes
Performance profiling changes
Technical debt assessment
Onboarding reviews (mentorship mode)

When NOT to Use:

Trivial changes (typos, comments only)
Automated dependency updates (use dependabot labels)
Generated code (migrations, scaffolds)

Prerequisites

Required

Code to review (diff, file, or directory)
At least one AI available (Claude required, Gemini/Codex optional)

Integration

GitHub Actions (optional, for CI/CD)
Pre-commit hooks (optional, for local checks)

Review Dimensions

5-Dimensional Analysis

Severity Levels

Operations

Operation 1: Quick Security Scan

Time: 2-5 minutes Automation: 80% Purpose: Fast security-focused review

Process:

Scan for Critical Issues:

Review this code for security vulnerabilities:
- SQL injection
- XSS vulnerabilities
- Hardcoded secrets/API keys
- Authentication bypasses
- Authorization flaws
- Input validation gaps
- Insecure dependencies

Code:
[PASTE CODE OR DIFF]

For each issue found, provide:
- Severity (Critical/High/Medium)
- Location (file:line)
- Description (what's wrong)
- Fix (specific code change)

Validate with Gemini (optional):

gemini -p "Verify these security findings. Are any false positives?
[PASTE CLAUDE FINDINGS]

Code context:
[PASTE RELEVANT CODE]"

Output: Security report with consensus findings

Operation 2: Comprehensive PR Review

Time: 10-30 minutes Automation: 60% Purpose: Full multi-dimensional review

Process:

Step 1: Gather Context

# Get PR diff
git diff main...HEAD > /tmp/pr_diff.txt

# Identify affected areas
grep -E "^(\\+\\+\\+|---)" /tmp/pr_diff.txt | head -20

Step 2: Run Parallel Agent Reviews

Use Task tool to launch parallel agents:

Launch 3 parallel review agents:

Agent 1 (Security):
"Review this diff for security issues. Focus on:
- OWASP Top 10 vulnerabilities
- Authentication/authorization
- Input validation
- Secrets exposure
Diff: [DIFF]"

Agent 2 (Maintainability):
"Review this diff for maintainability. Focus on:
- Design patterns used correctly
- Code duplication (DRY)
- Modularity and cohesion
- Documentation quality
Diff: [DIFF]"

Agent 3 (Correctness):
"Review this diff for correctness. Focus on:
- Logic errors
- Edge cases not handled
- Test coverage gaps
- Error handling
Diff: [DIFF]"

Step 3: Orchestrate & Deduplicate

Synthesize findings from all agents:
[PASTE ALL AGENT OUTPUTS]

Tasks:
1. Remove duplicate findings
2. Rank by severity (Critical > High > Medium > Low)
3. Group by file
4. Generate summary table
5. Create final report with consensus issues only

Step 4: Generate Report

Output format:

## PR Review Summary

| File | Risk | Issues | Critical | High | Medium |
|------|------|--------|----------|------|--------|
| auth.py | High | 3 | 1 | 2 | 0 |
| api.py | Medium | 2 | 0 | 1 | 1 |

### Critical Issues (Block Merge)
1. **[auth.py:45]** SQL Injection vulnerability
   - Why: User input directly in query
   - Fix: Use parameterized queries

### High Issues (Require Fix)
...

### Consensus Score: 72/100
- Security: 65/100
- Performance: 80/100
- Maintainability: 70/100
- Correctness: 75/100
- Style: 85/100

Operation 3: LLM-as-Judge Tribunal

Time: 5-15 minutes Automation: 70% Purpose: High-confidence findings through consensus

Process:

Run Code Through Multiple Models:

Claude Analysis:

Analyze this code for issues. Rate severity 1-10 for each:
[CODE]

Gemini Analysis (via CLI):

gemini -p "Analyze this code for issues. Rate severity 1-10 for each:
[CODE]"

Codex Analysis (via CLI):

codex "Analyze this code for issues. Rate severity 1-10 for each:
[CODE]"

Calculate Consensus:

Given these analyses from 3 AI models:

Claude: [FINDINGS]
Gemini: [FINDINGS]
Codex: [FINDINGS]

Identify issues where at least 2 models agree:
1. List consensus findings
2. Average severity scores
3. Note any disagreements
4. Final verdict for each issue

Output: High-confidence issue list (≥67% agreement)

Operation 4: Mentorship Review

Time: 15-30 minutes Automation: 40% Purpose: Educational code review for learning

Process:

Review this code in mentorship mode. For a developer learning [LANGUAGE/FRAMEWORK]:

Code: [CODE]

For each finding:
1. **What's the issue** (be encouraging, not critical)
2. **Why it matters** (explain the underlying concept)
3. **How to improve** (show before/after with explanation)
4. **Learn more** (link to relevant documentation)

Also highlight:
- What was done well
- Good patterns to continue using
- Growth opportunities

Tone: Supportive and educational, never condescending.

Operation 5: Pre-Release Audit

Time: 30-60 minutes Automation: 50% Purpose: Comprehensive review before production

Process:

Full Codebase Scan:

# Identify all changes since last release
git diff v1.0.0...HEAD --stat
git log v1.0.0...HEAD --oneline

Security Deep Dive:

Run all security checks
Verify no new vulnerabilities
Check dependency updates
Audit secrets management

Performance Review:

Identify potential bottlenecks
Review database queries
Check for N+1 problems
Validate caching strategies

Test Coverage:

Verify test coverage targets
Check critical path coverage
Validate edge case tests

Generate Release Report:

## Pre-Release Audit: v1.1.0

### Security Clearance: PASS ✓
- No critical vulnerabilities
- All high issues resolved
- Secrets audit: Clean

### Performance Assessment: PASS ✓
- No new N+1 queries
- Response time within SLA
- Memory usage stable

### Test Coverage: 82% (target: 80%)
- Critical paths: 95%
- Edge cases: 78%

### Release Recommendation: APPROVED

Multi-AI Coordination

Agent Assignment Strategy

Coordination Commands

Launch Multi-Agent Review:

# Using Task tool for parallel execution
# Each agent reviews independently, orchestrator synthesizes

Gemini Quick Check:

gemini -p "Quick security scan of this code: [CODE]"

Codex Deep Analysis:

codex "Analyze this code architecture and suggest improvements: [CODE]"

CI/CD Integration

GitHub Actions Workflow

# .github/workflows/ai-review.yml
name: Multi-AI Code Review
on: [pull_request]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR Diff
        run: |
          git diff origin/main...HEAD > pr_diff.txt

      - name: Claude Review
        uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          model: "claude-sonnet-4-5-20250929"
          review_level: "detailed"

      - name: Post Summary
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## AI Review Summary\n${process.env.REVIEW_SUMMARY}`
            })

Quality Gate Configuration

# Block merge for critical issues
quality_gates:
  critical_issues: 0      # Must be zero
  high_issues: 3          # Max allowed
  coverage_minimum: 80    # Percent
  score_minimum: 70       # Out of 100

Quality Scoring

Scoring Formula

Overall = (Security × 0.25) + (Performance × 0.20) +
          (Maintainability × 0.25) + (Correctness × 0.20) +
          (Style × 0.10)

Grade Mapping

Anti-Patterns to Detect

Hardcoded Secrets - API keys, passwords in code
SQL Injection - Unparameterized queries
XSS Vulnerabilities - Unsanitized output
Race Conditions - Unprotected shared state
Memory Leaks - Unclosed resources
N+1 Queries - Loop database calls
Dead Code - Unreachable branches
God Objects - Classes doing too much
Copy-Paste Code - Duplicated logic
Missing Error Handling - Unhandled exceptions

Example Review Session

User: Review this PR for my authentication module

Claude: I'll perform a comprehensive multi-dimensional review.

[Launches parallel agents for security, maintainability, correctness]

## PR Review: Authentication Module

### Critical Issues (1)
1. **[auth.py:67]** Password stored in plaintext
   - Severity: Critical
   - Consensus: 3/3 models agree
   - Fix: Use bcrypt hashing
   ```python
   # Before
   user.password = request.password

   # After
   import bcrypt
   user.password = bcrypt.hashpw(request.password.encode(), bcrypt.gensalt())

High Issues (2)

[auth.py:45] No rate limiting on login endpoint
[auth.py:89] JWT secret hardcoded

Quality Score: 58/100 (Grade: F)

Security: 35/100 (Critical issues)
Performance: 70/100
Maintainability: 65/100
Correctness: 60/100
Style: 80/100

Recommendation: BLOCK MERGE

Resolve critical security issues before merging.


---

## Related Skills

- **multi-ai-testing**: Generate tests for reviewed code
- **multi-ai-verification**: Validate fixes
- **multi-ai-implementation**: Implement suggested fixes
- **codex-review**: Codex-specific review patterns
- **review-multi**: Skill-specific reviews

---

## References

- `references/security-checklist.md` - OWASP Top 10 checklist
- `references/performance-patterns.md` - Performance anti-patterns
- `references/ci-cd-integration.md` - Full CI/CD setup guide

Adoption

adaptationio/multi-ai-code-review

$ install --global

Security Scan Results

SKILL.md

Multi-AI Code Review

Overview

When to Use

Prerequisites

Required

Recommended

Integration

Review Dimensions

5-Dimensional Analysis

Severity Levels

Operations

Operation 1: Quick Security Scan

Operation 2: Comprehensive PR Review

Operation 3: LLM-as-Judge Tribunal

Operation 4: Mentorship Review

Operation 5: Pre-Release Audit

Multi-AI Coordination

Agent Assignment Strategy

Coordination Commands

CI/CD Integration

GitHub Actions Workflow

Quality Gate Configuration

Quality Scoring

Scoring Formula

Grade Mapping

Anti-Patterns to Detect

Example Review Session

High Issues (2)

Quality Score: 58/100 (Grade: F)

Recommendation: BLOCK MERGE

Related Skills

adaptationio/ttyd-remote-terminal-wsl2

adaptationio/tri-ai-collaboration

adaptationio/todo-management

adaptationio/testing-workflow

adaptationio/multi-ai-code-review

$ install --global

Security Scan Results

SKILL.md

Multi-AI Code Review

Overview

When to Use

Prerequisites

Required

Recommended

Integration

Review Dimensions

5-Dimensional Analysis

Severity Levels

Operations

Operation 1: Quick Security Scan

Operation 2: Comprehensive PR Review

Operation 3: LLM-as-Judge Tribunal

Operation 4: Mentorship Review

Operation 5: Pre-Release Audit

Multi-AI Coordination

Agent Assignment Strategy

Coordination Commands

CI/CD Integration

GitHub Actions Workflow

Quality Gate Configuration

Quality Scoring

Scoring Formula

Grade Mapping

Anti-Patterns to Detect

Example Review Session

High Issues (2)

Quality Score: 58/100 (Grade: F)

Recommendation: BLOCK MERGE

Related Skills

adaptationio/ttyd-remote-terminal-wsl2

adaptationio/tri-ai-collaboration

adaptationio/todo-management

adaptationio/testing-workflow