skills/43-wentorai-research-plugins/skills/domains/cs/software-engineering-research/SKILL.md
Guide to software engineering research topics and methodologies
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research software-engineering-researchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Navigate the landscape of software engineering research, including key subfields, methodologies, datasets, benchmarks, and top venues.
| Subfield | Key Topics | Major Venues | |----------|-----------|-------------| | Software Testing | Test generation, fuzzing, mutation testing, flaky tests | ISSTA, ICST, ASE | | Program Analysis | Static analysis, abstract interpretation, symbolic execution | PLDI, POPL, OOPSLA | | Software Maintenance | Code refactoring, technical debt, code smells, evolution | ICSME, MSR, SANER | | SE for AI/ML | ML pipeline testing, data quality, model debugging | ICSE-SEIP, FSE | | AI for SE | Code generation, bug detection, program repair | ICSE, FSE, ASE | | Distributed Systems | Consensus, fault tolerance, scalability, microservices | SOSP, OSDI, EuroSys | | Cybersecurity | Vulnerability detection, malware analysis, privacy | IEEE S&P, CCS, USENIX Security | | HCI in SE | Developer tools, IDE usability, code comprehension | CHI, CSCW, VL/HCC | | Empirical SE | Mining repositories, developer surveys, controlled experiments | ESEM, MSR, TOSEM |
Testing a specific hypothesis with treatment and control groups:
Example: Does AI code completion improve developer productivity?
Design:
- Participants: 60 professional developers
- Treatment: IDE with AI code completion enabled
- Control: IDE with AI code completion disabled
- Task: Complete 5 programming tasks of varying difficulty
- Metrics: Task completion time, code correctness, lines of code
- Analysis: Mixed-effects linear model with participant as random effect
Threats to validity:
- Internal: Learning effect (counterbalance task order)
- External: Lab setting may not reflect real development
- Construct: "Productivity" operationalized as speed + correctness
Analyzing data from version control, issue trackers, code review systems:
# Example: Analyze commit patterns using PyDriller
from pydriller import Repository
repo_url = "https://github.com/apache/kafka"
commit_data = []
for commit in Repository(repo_url, since=datetime(2023, 1, 1),
to=datetime(2023, 12, 31)).traverse_commits():
commit_data.append({
"hash": commit.hash[:8],
"author": commit.author.name,
"date": commit.committer_date,
"files_changed": commit.files,
"insertions": commit.insertions,
"deletions": commit.deletions,
"message": commit.msg[:100]
})
df = pd.DataFrame(commit_data)
print(f"Total commits in 2023: {len(df)}")
print(f"Unique contributors: {df['author'].nunique()}")
print(f"Avg files per commit: {df['files_changed'].mean():.1f}")
In-depth investigation of a phenomenon in its real-world context:
Case Study Protocol (based on Yin, 2018):
1. Research questions: How do teams adopt microservices?
2. Unit of analysis: Development teams at 3 companies
3. Data sources:
- Semi-structured interviews (8-12 per company)
- Architecture documentation review
- Commit history and deployment logs
- Meeting observations
4. Analysis: Thematic analysis with cross-case comparison
5. Validity: Triangulation across data sources, member checking
| Benchmark | Task | Languages | Size | |-----------|------|-----------|------| | HumanEval | Code generation from docstrings | Python | 164 problems | | MBPP | Code generation from descriptions | Python | 974 problems | | SWE-bench | Real-world GitHub issue resolution | Python | 2,294 instances | | CodeXGLUE | Multiple code tasks | 6 languages | Varies by task | | BigCloneBench | Clone detection | Java | 6M clone pairs | | Defects4J | Bug localization and repair | Java | 835 real bugs |
| Dataset | Content | Use Cases | |---------|---------|-----------| | GHTorrent | GitHub event data (commits, issues, PRs) | MSR studies | | Software Heritage | Universal source code archive | Code evolution, provenance | | Stack Overflow Data Dump | Q&A posts, tags, votes | Developer knowledge, NLP | | CVE Database | Vulnerability records | Security research | | Chrome/Firefox Bug Trackers | Bug reports, patches | Bug triage, severity prediction |
# Example: Using tree-sitter for AST-level code analysis
from tree_sitter import Language, Parser
import tree_sitter_python as tspython
PYTHON_LANGUAGE = Language(tspython.language())
parser = Parser(PYTHON_LANGUAGE)
source_code = b"""
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
tree = parser.parse(source_code)
root = tree.root_node
def count_nodes(node, node_type):
"""Count AST nodes of a given type."""
count = 1 if node.type == node_type else 0
for child in node.children:
count += count_nodes(child, node_type)
return count
print(f"Function definitions: {count_nodes(root, 'function_definition')}")
print(f"If statements: {count_nodes(root, 'if_statement')}")
print(f"Return statements: {count_nodes(root, 'return_statement')}")
print(f"Function calls: {count_nodes(root, 'call')}")
# Common software metrics
metrics = {
"Lines of Code (LOC)": "Total lines (including blanks and comments)",
"Cyclomatic Complexity": "Number of independent paths (McCabe, 1976)",
"Halstead Volume": "Based on operators and operands count",
"Maintainability Index": "Composite of LOC, CC, and Halstead",
"Coupling Between Objects": "Number of other classes referenced",
"Depth of Inheritance": "Levels in class hierarchy",
"Code Churn": "Lines added + modified + deleted per period",
"Comment Density": "Ratio of comment lines to total lines"
}
# Calculate cyclomatic complexity using radon
# pip install radon
import subprocess
result = subprocess.run(
["radon", "cc", "my_module.py", "-s", "-j"],
capture_output=True, text=True
)
print(result.stdout)
| Venue | Type | Acceptance Rate | Focus | |-------|------|-----------------|-------| | ICSE | Conference | ~22% | Broad SE | | FSE/ESEC | Conference | ~24% | Broad SE | | ASE | Conference | ~22% | Automated SE | | ISSTA | Conference | ~25% | Software testing | | MSR | Conference | ~30% | Mining repositories | | TOSEM | Journal | -- | Broad SE (ACM) | | TSE | Journal | -- | Broad SE (IEEE) | | EMSE | Journal | -- | Empirical SE (Springer) |
| Venue | Type | Focus | |-------|------|-------| | SOSP/OSDI | Conference | Operating systems, distributed systems | | EuroSys | Conference | Systems (Europe) | | NSDI | Conference | Networked systems design | | IEEE S&P (Oakland) | Conference | Security and privacy | | USENIX Security | Conference | Security | | CCS | Conference | Computer and communications security | | NDSS | Conference | Network and distributed systems security |
| Tool | Purpose | URL | |------|---------|-----| | PyDriller | Git repository mining (Python) | github.com/ishepard/pydriller | | Radon | Python code metrics | github.com/rubik/radon | | SonarQube | Multi-language static analysis | sonarqube.org | | Understand | Code analysis and metrics | scitools.com | | Joern | Code analysis platform (CPG) | joern.io | | CodeQL | Semantic code analysis | codeql.github.com | | tree-sitter | Incremental parsing library | tree-sitter.github.io |
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.