skills/aidev-studying-ai-coding/SKILL.md
Analyze AI coding agent activity on GitHub repositories using the AIDev methodology. Identify agentic PRs, measure agent adoption metrics, evaluate PR quality, assess review dynamics, and benchmark human-AI collaboration patterns. Use when: 'analyze AI agent PRs in this repo', 'measure AI coding adoption', 'evaluate agentic PR quality', 'compare AI agent contributions', 'audit AI-generated code patterns', 'assess review burden of AI PRs'.
npx skillsauth add ndpvt-web/arxiv-claude-skills aidev-studying-ai-codingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically analyze how AI coding agents (Codex, Devin, GitHub Copilot, Cursor, Claude Code) contribute to software repositories. Using the classification framework and metrics from the AIDev study of 932,791 agentic pull requests across 116,211 repositories, Claude can identify agent-authored PRs, measure adoption patterns, evaluate code quality signals, assess review dynamics, and surface failure risks — giving teams actionable insight into their human-AI collaboration effectiveness.
The AIDev methodology identifies agentic PRs — pull requests authored or substantially produced by AI coding agents — by matching known agent signatures in PR metadata: author accounts, commit trailers, bot tags, and PR body patterns specific to each agent (e.g., Devin's devin-ai-integration accounts, Copilot's copilot trailer tags, Claude Code's Co-Authored-By trailers). This signature-based detection is the foundation: without reliable identification, no downstream analysis is meaningful.
Once agentic PRs are identified, the framework applies multi-dimensional characterization across five domains: (1) adoption demographics and practices, (2) code patch characteristics including additions/deletions ratios and file-type distributions, (3) testing behavior measured by test-to-code churn ratios, (4) review dynamics including comment types, resolution rates, and time-to-merge, and (5) failure patterns such as reversion frequency, CI failures, and security risks. PRs are classified by purpose using Conventional Commits categories (feat, fix, refactor, test, docs, chore, etc.), enabling structured comparison across agents and projects.
The critical insight is that raw merge rate is insufficient for evaluating AI agent effectiveness. A holistic assessment requires examining the full lifecycle: Does the PR include tests? Does it pass CI? How many review rounds does it need? Is it reverted later? Does it introduce security issues? Teams that only track "did it merge" miss the review burden, maintenance cost, and risk dimensions that determine actual productivity impact.
Collect PR metadata from the target repository using the GitHub API (gh pr list --json author,labels,body,commits,reviews,state,createdAt,mergedAt). Retrieve at minimum: author login, PR title, body, commit messages, review comments, CI status, and merge state.
Identify agentic PRs by matching agent signatures against known patterns:
codex, commit trailers with openai-codexdevin-ai-integration/* or PR bodies referencing Devin sessionscopilot, author copilot or github-copilotcursor, author patterns with cursorCo-Authored-By trailers mentioning Claude, PR bodies referencing Claude Code sessionsClassify each agentic PR by purpose using Conventional Commits parsing on the PR title and commit messages. Map to categories: feat (new feature), fix (bug fix), refactor, test, docs, chore, ci, style, perf. For PRs without conventional prefixes, infer from the diff: majority additions = likely feat, majority deletions = likely refactor/cleanup, test files only = test.
Compute code patch metrics for each PR: total files changed, lines added, lines deleted, add/delete ratio, file types touched (source vs. config vs. test vs. docs), and whether the PR modifies only generated or boilerplate files versus core logic.
Evaluate testing behavior by checking: Does the PR include changes to test files? What is the test-to-source churn ratio (lines changed in test files / lines changed in source files)? Does the PR add new test files or only modify existing ones? Flag PRs that change source code without any corresponding test changes.
Analyze review dynamics by extracting: number of review rounds before merge, types of review comments (correctness, style, security, testing, performance), whether the agent responded to review feedback with follow-up commits, time from PR creation to first review, and time from first review to merge.
Detect failure signals by checking: CI pipeline pass/fail status, whether the PR was later reverted (search for revert commits referencing the PR), whether hotfix PRs followed shortly after merge, and whether the diff introduces patterns associated with security risks (hardcoded secrets, insecure API usage, disabled security checks).
Aggregate into a report with per-agent and per-category breakdowns: merge rate, median time-to-merge, median review rounds, test coverage rate, CI pass rate, reversion rate, and review burden score (comments per PR). Compare against the repository's human-authored PR baselines for the same period.
Surface actionable recommendations based on the analysis: identify which agent-task combinations perform best, flag categories with high review burden or reversion rates, recommend where human oversight should be increased, and suggest process improvements (e.g., requiring test changes for agent feat PRs).
Export structured data as JSON or CSV for integration with dashboards, further analysis, or time-series tracking of AI adoption trends.
Example 1: Repository AI Adoption Audit
User: "Analyze AI coding agent usage in our repository over the last 6 months"
Approach:
gh pr listOutput:
## AI Agent PR Analysis: myorg/myrepo (Aug 2025 - Jan 2026)
### Adoption Summary
- Total PRs: 1,247
- Agentic PRs: 183 (14.7%)
- Agents detected: Claude Code (98), GitHub Copilot (52), Cursor (33)
### Monthly Trend
| Month | Total PRs | Agentic PRs | % Agentic |
|----------|-----------|-------------|-----------|
| Aug 2025 | 189 | 12 | 6.3% |
| Sep 2025 | 211 | 19 | 9.0% |
| Oct 2025 | 198 | 27 | 13.6% |
| Nov 2025 | 223 | 34 | 15.2% |
| Dec 2025 | 201 | 41 | 20.4% |
| Jan 2026 | 225 | 50 | 22.2% |
### By Purpose (Conventional Commits)
- fix: 72 (39.3%) — most common agent task
- feat: 48 (26.2%)
- refactor: 31 (16.9%)
- docs: 18 (9.8%)
- chore/ci: 14 (7.7%)
### Key Findings
- Agent adoption doubled over 6 months (6.3% → 22.2%)
- Claude Code is the dominant agent (53.6% of agentic PRs)
- Bug fixes are the most common agent task type
- 3 developers account for 61% of all agentic PRs
Example 2: Agentic PR Quality Assessment
User: "Evaluate the quality of AI-generated PRs compared to human PRs in this repo"
Approach:
Output:
## PR Quality Comparison: Agentic vs. Human
| Metric | Agentic (n=183) | Human (n=1064) |
|-------------------------------|-----------------|----------------|
| Merge rate | 78.1% | 84.3% |
| CI pass on first push | 62.8% | 71.2% |
| Median review rounds | 2.1 | 1.4 |
| PRs including test changes | 34.4% | 58.7% |
| Median lines changed | 87 | 143 |
| Reversion within 7 days | 4.9% | 2.1% |
| Median time-to-merge (hours) | 18.3 | 22.7 |
### Review Comment Analysis (curated subset, n=89 agentic PRs)
- Correctness issues raised: 41.6% of agentic PRs
- Style/formatting issues: 23.6%
- Missing test coverage flagged: 33.7%
- Security concerns raised: 5.6%
### Recommendations
1. REQUIRE test changes for agentic feat/fix PRs — 65.6% ship without tests
2. ADD pre-merge CI gate — 37.2% fail CI on first push
3. INCREASE review scrutiny for refactor PRs — highest reversion rate (8.1%)
4. MONITOR security-flagged PRs — small count but high severity
Example 3: Identifying Agent Signatures in a PR
User: "Is this PR authored by an AI agent? Check PR #4521"
Approach:
Output:
## Agent Detection: PR #4521
**Result: Agentic PR detected — Claude Code (HIGH confidence)**
Signals matched:
[x] Commit trailer: "Co-Authored-By: Claude <[email protected]>"
[x] PR body contains: "Generated with Claude Code"
[ ] Author account: standard user account (john-dev)
[ ] No Devin/Codex/Copilot/Cursor markers found
Classification: fix (from commit prefix "fix: resolve null pointer in auth middleware")
Files changed: 3 (src/auth/middleware.ts, src/auth/middleware.test.ts, CHANGELOG.md)
Test included: Yes (middleware.test.ts modified)
CI status: Passing
fix PRs against human feat PRs produces misleading results. Always compare like-for-like.gh api --paginate and store intermediate JSON to avoid re-fetching.gh CLI is authenticated with appropriate scopes. If reviews or comments return 404, the token may lack repo scope.Paper: Li, H., Zhang, H., & Hassan, A.E. (2026). AIDev: Studying AI Coding Agents on GitHub. arXiv:2602.09185v1. https://arxiv.org/abs/2602.09185v1
Dataset: Available on Hugging Face and Zenodo (DOI: 10.5281/zenodo.16899501).
Key takeaway: Look at Sections 3 (dataset construction) for agent identification methodology, Section 4 (research opportunities) for the five-domain analysis framework, and the 14 data tables for replicable metric computation on your own repositories.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".