aops-tools/skills/analyst/SKILL.md
Support academic research data analysis with technology-agnostic principles — research-data immutability, a versioned/tested/reproducible transformation layer, statistical methodology, and self-documenting research. Use this skill for any computational research project with an empirical data pipeline. The skill enforces academicOps best practices for reproducible, transparent research with a collaborative single-step workflow. Tech-specific how-to (dbt, Streamlit, Python plotting/stats) lives in the aops-extras package.
npx skillsauth add nicsuzor/academicops analystInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Taxonomy note: This skill provides tech-agnostic domain principles (HOW) for research data analysis. Technology-specific how-to (dbt, Streamlit, Python plotting/stats) lives in the aops-extras package skills. See [[aops-core/skills/remember/references/TAXONOMY.md]] for the skill/workflow distinction.
Support academic research data analysis through technology-agnostic principles: reproducible data pipelines, automated testing, self-documenting code, and fail-fast validation. The principles here hold regardless of which transformation engine or dashboard tool you use. When you have settled on specific tooling, pair this skill with the relevant aops-extras skill (dbt, streamlit, python-viz) for the concrete commands.
Core principle: Take ONE action at a time (generate a chart, update database, create a test), then yield to the user for feedback before proceeding.
Source datasets, ground truth labels, experimental records, and research configurations are SACRED. NEVER modify, reformat, or "fix" them. If infrastructure doesn't support a format: HALT and report. Violations are scholarly misconduct.
Data directory separation: Local data files (data/) and build output directories (output/, _book/, etc.) MUST NOT overlap. Build tools clean their output directories — any data stored there will be destroyed. See [[instructions/research-documentation.md#data-directory-separation-critical]] for the full convention.
ALL data transformation happens in a versioned, tested, reproducible transformation layer. The presentation layer ONLY displays pre-computed data. Period.
This is non-negotiable for academic integrity, reproducibility, and auditability. It is a property of the architecture, not of any particular tool. (e.g. the transformation layer might be a dbt project, a SQL pipeline, or scripted notebooks under version control; the presentation layer might be a Streamlit dashboard, a static report, or a notebook viewer. See the aops-extras dbt and streamlit skills for those concrete implementations.)
| Layer | Allowed | Prohibited | | ------------------ | ------------------------------------------------------------------- | ------------------------------------------------------------------ | | Transformation | ALL transformations, joins, aggregations, filtering, business logic | - | | Presentation | Display, formatting, interactive filtering of PRE-COMPUTED data | Any operation that transforms, joins, aggregates, or applies logic |
Need a new metric? → Add it to the transformation layer with tests Need to filter data? → Pre-compute the filtered view in the transformation layer OR filter on EXISTING columns in the presentation layer (no new calculations) Need to join tables? → Do the join in the transformation layer Need aggregations? → Compute them in the transformation layer
The presentation layer may:
SELECT * FROM precomputed_table)WHERE column = :user_selection)The presentation layer must NEVER:
SUM(...) GROUP BY ... = transformation)a.*, b.* FROM a JOIN b = transformation)CASE WHEN ... END = transformation)STOP. Move the transformation into the transformation layer instead:
This takes more time. That's the point. Transformations deserve scrutiny.
[[references/context-discovery.md]], [[references/quick-reference-commands.md]]
Start with [[references/statistical-analysis.md]] (complete guide). Also: [[references/test_selection_guide.md]], [[references/assumptions_and_diagnostics.md]], [[references/effect_sizes_and_power.md]], [[references/bayesian_statistics.md]], [[references/reporting_standards.md]].
The concrete how-to for particular tools lives in the aops-extras package, so it can be swapped for official/community-consensus skills:
dbt — transformation-layer implementation (models, tests, marts).streamlit — presentation-layer implementation (display-only dashboards).python-viz — Python plotting & statistical-modelling libraries (matplotlib, seaborn, statsmodels). Use the python-dev skill for code standards.Invoke this skill when:
streamlit skill for that engine)dbt skill for that engine)Key indicators in project structure:
dbt/models/ directory — staging, intermediate, marts)streamlit/ directory or dashboard .py files)data/warehouse.db or similar analytical databaseSTART
│
├─ Is this a new analysis task?
│ ├─ YES → Go to: Context Discovery
│ └─ NO → Is context already loaded?
│ ├─ YES → Go to: Task Execution
│ └─ NO → Go to: Context Discovery
│
Context Discovery (REQUIRED FIRST STEP)
│
├─ Read project context files:
│ ├─ README.md (current directory + all parents to project root)
│ ├─ data/README.md (if exists)
│ └─ data/projects/[project-name].md (if exists)
│
├─ Identify project conventions:
│ ├─ Research questions
│ ├─ Data sources and access patterns
│ ├─ Existing transformation-layer models (list them)
│ ├─ Testing strategy
│ └─ Project-specific rules
│
└─ Proceed to: Task Execution
│
Task Execution
│
├─ What type of task?
│ ├─ Data access → Go to: Data Access Workflow
│ ├─ Visualization → Go to: Visualization Workflow
│ ├─ Transformation model → Go to: Transformation Model Workflow
│ ├─ Testing → Go to: Testing Workflow
│ └─ Exploration → Go to: Exploratory Analysis
│
└─ After completing ONE step:
├─ Report results to user
├─ Explain what was done
└─ STOP and wait for user feedback
CRITICAL FIRST STEP: Before any analysis work, automatically discover and read project context.
Project README files
README.mdpapers/automod/, projects/buttermilk/)Data README
data/README.md in the projectProject overview
data/projects/[project-name].md corresponding to current projectFrom these files, identify:
Example context discovery:
# List existing transformation-layer models (engine-specific; e.g. dbt)
ls -1 dbt/models/staging/*.sql dbt/models/marts/*.sql
# Check for presentation-layer apps (engine-specific; e.g. Streamlit)
ls -1 streamlit/*.py
# Understand project structure
cat README.md
cat data/README.md
The example commands above assume a dbt + Streamlit stack. For the concrete per-engine discovery commands, see the aops-extras
dbtandstreamlitskills.
After context discovery, summarize findings to user:
"I've reviewed the project context. This is a <research topic> project investigating <questions>. The transformation layer has <N> staging models and <M> mart models. I see existing work on <areas>. What would you like me to help with?"
🚨 CRITICAL RULE: ALL data access MUST go through the modelled transformation layer. NEVER query raw upstream sources directly.
🚨 REMINDER: If you need to transform data, that transformation MUST live in the transformation layer with tests. See "Transformation Layer vs Presentation Layer" above.
Need data for analysis?
│
├─ Does required data exist in the modelled (mart) layer?
│ ├─ YES → Read it (e.g. `SELECT * FROM mart_name`)
│ │ └─ Done! Use this data in analysis.
│ │
│ └─ NO → Does it exist in staging models?
│ ├─ YES → Should this become a new mart?
│ │ ├─ YES → Go to: Transformation Model Workflow (create mart)
│ │ └─ NO → Use staging model for exploratory work
│ │
│ └─ NO → Data doesn't exist in the transformation layer yet
│ └─ Ask user: "Should I create a model for [data source]?"
│ ├─ YES → Go to: Transformation Model Workflow (create staging model)
│ └─ NO → Stop. Cannot proceed without a modelled source.
❌ NEVER do this:
# Direct BigQuery query against raw source - PROHIBITED
df = client.query("SELECT * FROM bigquery.raw.cases").to_dataframe()
# Direct database query against raw schema - PROHIBITED
df = pd.read_sql("SELECT * FROM raw_schema.table", engine)
# Direct API call for analysis data - PROHIBITED
response = requests.get("https://api.example.com/data")
✅ ALWAYS do this:
# Query through the modelled layer - CORRECT
import duckdb
conn = duckdb.connect("data/warehouse.db")
df = conn.execute("SELECT * FROM fct_case_decisions").df() # fct_* = a tested mart
See: the aops-extras dbt skill for the dbt implementation of this policy.
Create or modify transformation-layer models following academicOps layered architecture. The layering below is engine-neutral; the aops-extras dbt skill gives the dbt-specific commands and file layout.
stg_*) - Clean and standardize raw data (no business logic)int_*) - Business logic transformations (can be ephemeral)fct_*, dim_*) - Analysis-ready datasets (materialized)ALWAYS check for duplicate models before creating new ones.
See: the aops-extras dbt skill for complete workflow details and comprehensive patterns.
Create presentation-layer visualizations following the single-step collaborative pattern.
🚨 REMINDER: The presentation layer is DISPLAY ONLY. No transformations. See "Transformation Layer vs Presentation Layer" above.
For the detailed engine-specific workflow (structure, single-step patterns, examples), see the aops-extras streamlit skill.
Load data → STOP → Create chart → STOP → Add interactivity → STOP. One change at a time. See the aops-extras streamlit skill for engine-specific tips (e.g. Streamlit hot-reload).
Add tests to validate data quality at every pipeline stage.
Use appropriate test type for the validation:
| Test Type | Use For | Example | | --------------------- | ------------------- | ---------------------------------------------- | | Schema tests | Column-level checks | not_null, unique, accepted_values | | Singular tests | Multi-column logic | Date range validation, cross-table consistency | | Package tests | Common patterns | Recency checks, multi-column uniqueness | | Diagnostic models | Quality monitoring | Aggregated metrics for manual review |
Step 1: Identify what to test
Review the model and ask:
STOP. Discuss with user which tests to add.
Step 2: Add schema tests (after user agrees on test plan)
The examples below use dbt's schema.yml syntax to illustrate the principle — column-level tests declared alongside the model. See the aops-extras dbt skill for the full engine-specific testing reference; any transformation engine should provide an equivalent declarative test layer.
# dbt/schema.yml (dbt example)
models:
- name: stg_cases
columns:
- name: case_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ["pending", "reviewed", "published"]
STOP. Show to user.
Step 3: Run tests (after user approves test definitions)
dbt test --select stg_cases
STOP. Report results. If failures, discuss with user before fixing.
Step 4: Add singular test if needed (complex validation)
-- tests/assert_decision_dates_logical.sql
select
case_id,
submission_date,
decision_date
from {{ ref('stg_cases') }}
where decision_date < submission_date
STOP. Show test SQL to user.
Step 5: Run singular test
dbt test --select test_name:assert_decision_dates_logical
STOP. Report results.
Use severity: warn for known issues or aspirational standards:
tests:
- not_null:
severity: warn # Don't fail build, just warn
When testing LLM pipelines or templated content, validate substantive content not just error patterns:
.*? doesn't cross newlines)See: the aops-extras dbt skill for complete engine-specific testing patterns.
When investigating data quality issues (missing values, unexpected patterns, join coverage), create REUSABLE investigation scripts in analyses/ directory. Never use throwaway one-liners for data investigation.
For complete workflow, script templates, and when to create investigation scripts, see [[instructions/data-investigation.md]]
When exploring data patterns and relationships, follow collaborative discovery process. Take one analytical step at a time, yielding to user after each finding.
For complete exploration workflow and anti-patterns, see [[instructions/exploratory-analysis.md]]
NOTE: For data quality issues (missing values, unexpected nulls), use Data Investigation Workflow instead.
Self-documenting work: Do NOT create separate analysis reports or random documentation files.
🚨 CRITICAL: Research projects must follow STRICT documentation structure. See [[instructions/research-documentation.md]] for complete requirements.
Research projects MUST maintain:
dbt/schema.yml)dbt/schema.yml)❌ Create analysis_report.md]] or any random markdown files ❌ Createfindings_summary.docx` ❌ Proliferate documentation files without defined structure ❌ Leave documentation stale when code changes
✅ Follow strict structure defined in [[instructions/research-documentation.md]] ✅ Update documentation in SAME commit as code changes ✅ One source of truth for each piece of information
One step at a time:
Never:
Always:
See [[references/quick-reference-commands.md]] for common data-pipeline and DuckDB commands. For engine-specific commands, see the aops-extras dbt and streamlit skills.
tools
Streamlit implementation of the analyst presentation layer. Use when building or updating a Streamlit dashboard that displays pre-computed research data. This is the Streamlit-specific HOW for the tech-agnostic principles in the aops-tools analyst skill — display only, never transform.
tools
Python plotting and statistical-modelling libraries (matplotlib, seaborn, statsmodels) for the analyst presentation and statistical-methodology layers. Use when producing publication-quality figures or fitting statistical models in Python. Library-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
tools
dbt (data build tool) implementation of the analyst transformation layer. Use when a project has a dbt/ directory or you need to build, test, or document SQL transformations as version-controlled, reproducible dbt models. This is the dbt-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
development
Core academicOps skill — institutional memory, strategic coordination, workflow routing, and framework governance. Merges butler (chief-of-staff) with framework development conventions.