Name: software-engineering-research
Author: wentorai

Software Engineering Research Guide

Navigate the landscape of software engineering research, including key subfields, methodologies, datasets, benchmarks, and top venues.

SE Research Subfields

| Subfield | Key Topics | Major Venues | |----------|-----------|-------------| | Software Testing | Test generation, fuzzing, mutation testing, flaky tests | ISSTA, ICST, ASE | | Program Analysis | Static analysis, abstract interpretation, symbolic execution | PLDI, POPL, OOPSLA | | Software Maintenance | Code refactoring, technical debt, code smells, evolution | ICSME, MSR, SANER | | SE for AI/ML | ML pipeline testing, data quality, model debugging | ICSE-SEIP, FSE | | AI for SE | Code generation, bug detection, program repair | ICSE, FSE, ASE | | Distributed Systems | Consensus, fault tolerance, scalability, microservices | SOSP, OSDI, EuroSys | | Cybersecurity | Vulnerability detection, malware analysis, privacy | IEEE S&P, CCS, USENIX Security | | HCI in SE | Developer tools, IDE usability, code comprehension | CHI, CSCW, VL/HCC | | Empirical SE | Mining repositories, developer surveys, controlled experiments | ESEM, MSR, TOSEM |

Research Methodologies in SE

Controlled Experiments

Testing a specific hypothesis with treatment and control groups:

Example: Does AI code completion improve developer productivity?

Design:
- Participants: 60 professional developers
- Treatment: IDE with AI code completion enabled
- Control: IDE with AI code completion disabled
- Task: Complete 5 programming tasks of varying difficulty
- Metrics: Task completion time, code correctness, lines of code
- Analysis: Mixed-effects linear model with participant as random effect

Threats to validity:
- Internal: Learning effect (counterbalance task order)
- External: Lab setting may not reflect real development
- Construct: "Productivity" operationalized as speed + correctness

Mining Software Repositories (MSR)

Analyzing data from version control, issue trackers, code review systems:

# Example: Analyze commit patterns using PyDriller
from pydriller import Repository

repo_url = "https://github.com/apache/kafka"

commit_data = []
for commit in Repository(repo_url, since=datetime(2023, 1, 1),
                          to=datetime(2023, 12, 31)).traverse_commits():
    commit_data.append({
        "hash": commit.hash[:8],
        "author": commit.author.name,
        "date": commit.committer_date,
        "files_changed": commit.files,
        "insertions": commit.insertions,
        "deletions": commit.deletions,
        "message": commit.msg[:100]
    })

df = pd.DataFrame(commit_data)
print(f"Total commits in 2023: {len(df)}")
print(f"Unique contributors: {df['author'].nunique()}")
print(f"Avg files per commit: {df['files_changed'].mean():.1f}")

Case Studies

In-depth investigation of a phenomenon in its real-world context:

Case Study Protocol (based on Yin, 2018):
1. Research questions: How do teams adopt microservices?
2. Unit of analysis: Development teams at 3 companies
3. Data sources:
   - Semi-structured interviews (8-12 per company)
   - Architecture documentation review
   - Commit history and deployment logs
   - Meeting observations
4. Analysis: Thematic analysis with cross-case comparison
5. Validity: Triangulation across data sources, member checking

Key Datasets and Benchmarks

Code Understanding and Generation

| Benchmark | Task | Languages | Size | |-----------|------|-----------|------| | HumanEval | Code generation from docstrings | Python | 164 problems | | MBPP | Code generation from descriptions | Python | 974 problems | | SWE-bench | Real-world GitHub issue resolution | Python | 2,294 instances | | CodeXGLUE | Multiple code tasks | 6 languages | Varies by task | | BigCloneBench | Clone detection | Java | 6M clone pairs | | Defects4J | Bug localization and repair | Java | 835 real bugs |

Software Engineering Process

| Dataset | Content | Use Cases | |---------|---------|-----------| | GHTorrent | GitHub event data (commits, issues, PRs) | MSR studies | | Software Heritage | Universal source code archive | Code evolution, provenance | | Stack Overflow Data Dump | Q&A posts, tags, votes | Developer knowledge, NLP | | CVE Database | Vulnerability records | Security research | | Chrome/Firefox Bug Trackers | Bug reports, patches | Bug triage, severity prediction |

Static Analysis Tools for Research

# Example: Using tree-sitter for AST-level code analysis
from tree_sitter import Language, Parser
import tree_sitter_python as tspython

PYTHON_LANGUAGE = Language(tspython.language())
parser = Parser(PYTHON_LANGUAGE)

source_code = b"""
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

tree = parser.parse(source_code)
root = tree.root_node

def count_nodes(node, node_type):
    """Count AST nodes of a given type."""
    count = 1 if node.type == node_type else 0
    for child in node.children:
        count += count_nodes(child, node_type)
    return count

print(f"Function definitions: {count_nodes(root, 'function_definition')}")
print(f"If statements: {count_nodes(root, 'if_statement')}")
print(f"Return statements: {count_nodes(root, 'return_statement')}")
print(f"Function calls: {count_nodes(root, 'call')}")

Code Metrics

# Common software metrics
metrics = {
    "Lines of Code (LOC)": "Total lines (including blanks and comments)",
    "Cyclomatic Complexity": "Number of independent paths (McCabe, 1976)",
    "Halstead Volume": "Based on operators and operands count",
    "Maintainability Index": "Composite of LOC, CC, and Halstead",
    "Coupling Between Objects": "Number of other classes referenced",
    "Depth of Inheritance": "Levels in class hierarchy",
    "Code Churn": "Lines added + modified + deleted per period",
    "Comment Density": "Ratio of comment lines to total lines"
}

# Calculate cyclomatic complexity using radon
# pip install radon
import subprocess
result = subprocess.run(
    ["radon", "cc", "my_module.py", "-s", "-j"],
    capture_output=True, text=True
)
print(result.stdout)

Top Venues and Impact

Tier-1 SE Venues

| Venue | Type | Acceptance Rate | Focus | |-------|------|-----------------|-------| | ICSE | Conference | ~22% | Broad SE | | FSE/ESEC | Conference | ~24% | Broad SE | | ASE | Conference | ~22% | Automated SE | | ISSTA | Conference | ~25% | Software testing | | MSR | Conference | ~30% | Mining repositories | | TOSEM | Journal | -- | Broad SE (ACM) | | TSE | Journal | -- | Broad SE (IEEE) | | EMSE | Journal | -- | Empirical SE (Springer) |

Systems and Security Venues

| Venue | Type | Focus | |-------|------|-------| | SOSP/OSDI | Conference | Operating systems, distributed systems | | EuroSys | Conference | Systems (Europe) | | NSDI | Conference | Networked systems design | | IEEE S&P (Oakland) | Conference | Security and privacy | | USENIX Security | Conference | Security | | CCS | Conference | Computer and communications security | | NDSS | Conference | Network and distributed systems security |

Research Tools Ecosystem

| Tool | Purpose | URL | |------|---------|-----| | PyDriller | Git repository mining (Python) | github.com/ishepard/pydriller | | Radon | Python code metrics | github.com/rubik/radon | | SonarQube | Multi-language static analysis | sonarqube.org | | Understand | Code analysis and metrics | scitools.com | | Joern | Code analysis platform (CPG) | joern.io | | CodeQL | Semantic code analysis | codeql.github.com | | tree-sitter | Incremental parsing library | tree-sitter.github.io |

Software Engineering Research Guide

Navigate the landscape of software engineering research, including key subfields, methodologies, datasets, benchmarks, and top venues.

SE Research Subfields

Research Methodologies in SE

Controlled Experiments

Testing a specific hypothesis with treatment and control groups:

Example: Does AI code completion improve developer productivity?

Design:
- Participants: 60 professional developers
- Treatment: IDE with AI code completion enabled
- Control: IDE with AI code completion disabled
- Task: Complete 5 programming tasks of varying difficulty
- Metrics: Task completion time, code correctness, lines of code
- Analysis: Mixed-effects linear model with participant as random effect

Threats to validity:
- Internal: Learning effect (counterbalance task order)
- External: Lab setting may not reflect real development
- Construct: "Productivity" operationalized as speed + correctness

Mining Software Repositories (MSR)

Analyzing data from version control, issue trackers, code review systems:

# Example: Analyze commit patterns using PyDriller
from pydriller import Repository

repo_url = "https://github.com/apache/kafka"

commit_data = []
for commit in Repository(repo_url, since=datetime(2023, 1, 1),
                          to=datetime(2023, 12, 31)).traverse_commits():
    commit_data.append({
        "hash": commit.hash[:8],
        "author": commit.author.name,
        "date": commit.committer_date,
        "files_changed": commit.files,
        "insertions": commit.insertions,
        "deletions": commit.deletions,
        "message": commit.msg[:100]
    })

df = pd.DataFrame(commit_data)
print(f"Total commits in 2023: {len(df)}")
print(f"Unique contributors: {df['author'].nunique()}")
print(f"Avg files per commit: {df['files_changed'].mean():.1f}")

Case Studies

In-depth investigation of a phenomenon in its real-world context:

Case Study Protocol (based on Yin, 2018):
1. Research questions: How do teams adopt microservices?
2. Unit of analysis: Development teams at 3 companies
3. Data sources:
   - Semi-structured interviews (8-12 per company)
   - Architecture documentation review
   - Commit history and deployment logs
   - Meeting observations
4. Analysis: Thematic analysis with cross-case comparison
5. Validity: Triangulation across data sources, member checking

Key Datasets and Benchmarks

Code Understanding and Generation

Software Engineering Process

Static Analysis Tools for Research

# Example: Using tree-sitter for AST-level code analysis
from tree_sitter import Language, Parser
import tree_sitter_python as tspython

PYTHON_LANGUAGE = Language(tspython.language())
parser = Parser(PYTHON_LANGUAGE)

source_code = b"""
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

tree = parser.parse(source_code)
root = tree.root_node

def count_nodes(node, node_type):
    """Count AST nodes of a given type."""
    count = 1 if node.type == node_type else 0
    for child in node.children:
        count += count_nodes(child, node_type)
    return count

print(f"Function definitions: {count_nodes(root, 'function_definition')}")
print(f"If statements: {count_nodes(root, 'if_statement')}")
print(f"Return statements: {count_nodes(root, 'return_statement')}")
print(f"Function calls: {count_nodes(root, 'call')}")

Code Metrics

# Common software metrics
metrics = {
    "Lines of Code (LOC)": "Total lines (including blanks and comments)",
    "Cyclomatic Complexity": "Number of independent paths (McCabe, 1976)",
    "Halstead Volume": "Based on operators and operands count",
    "Maintainability Index": "Composite of LOC, CC, and Halstead",
    "Coupling Between Objects": "Number of other classes referenced",
    "Depth of Inheritance": "Levels in class hierarchy",
    "Code Churn": "Lines added + modified + deleted per period",
    "Comment Density": "Ratio of comment lines to total lines"
}

# Calculate cyclomatic complexity using radon
# pip install radon
import subprocess
result = subprocess.run(
    ["radon", "cc", "my_module.py", "-s", "-j"],
    capture_output=True, text=True
)
print(result.stdout)

Adoption

wentorai/software-engineering-research

$ install --global

Security Scan Results

SKILL.md

Software Engineering Research Guide

SE Research Subfields

Research Methodologies in SE

Controlled Experiments

Mining Software Repositories (MSR)

Case Studies

Key Datasets and Benchmarks

Code Understanding and Generation

Software Engineering Process

Static Analysis Tools for Research

Code Metrics

Top Venues and Impact

Tier-1 SE Venues

Systems and Security Venues

Research Tools Ecosystem

Related Skills

wentorai/thuthesis-guide

wentorai/thesis-writing-guide

wentorai/thesis-template-guide

wentorai/sjtuthesis-guide

wentorai/software-engineering-research

$ install --global

Security Scan Results

SKILL.md

Software Engineering Research Guide

SE Research Subfields

Research Methodologies in SE

Controlled Experiments

Mining Software Repositories (MSR)

Case Studies

Key Datasets and Benchmarks

Code Understanding and Generation

Software Engineering Process

Static Analysis Tools for Research

Code Metrics

Top Venues and Impact

Tier-1 SE Venues

Systems and Security Venues

Research Tools Ecosystem

Related Skills

wentorai/thuthesis-guide

wentorai/thesis-writing-guide

wentorai/thesis-template-guide

wentorai/sjtuthesis-guide