skills/wrds/SKILL.md
Use when "query WRDS", "pull SEC filings", "access Compustat/CRSP/ExecuComp/Capital IQ", "Form 4 insider data", "ISS governance/compensation", "TAQ intraday/NBBO", "SDC M&A or new issues", "FISD bonds", "Form D/ADV", "FJC court data", or any WRDS PostgreSQL query or SAS ETL on the WRDS grid (qsub/qsas/SGE).
npx skillsauth add edwinhu/workflows wrdsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ALWAYS write an SGE submission script and submit via qsub. No exceptions.
ssh wrds 'cat files.tsv | ./parser > output.tsv' → WRONG. Use qsub.ssh wrds 'nohup ./process &' → WRONG. Still the login node. Use qsub.ssh wrds 'python3 bulk_process.py' → WRONG. Use qsub.qsub -t 1-20 submit.sh → CORRECT.The login node is for: qsub, qstat, qdel, scp, ls, head, short psql queries.
See references/constraints/wrds-sge-enforcement.md for the full pattern and existing examples (quorum parser, state-of-incorp parser, SAS pipeline).
</EXTREMELY-IMPORTANT>
Running compute on the login node is NOT HELPFUL — it gets the user's account flagged, the job killed, and the work lost. You run on the login node because qsub feels like overhead. The overhead is 5 minutes of script writing. The downside is account suspension and a rerun from scratch.
| Excuse | Reality | Do Instead |
|--------|---------|------------|
| "It's a quick test, just one file" | One file becomes 100K when you forget to change the command | Write the SGE script first, test with -t 1-1 |
| "nohup makes it background, so it's fine" | nohup is still the login node — same shared CPU | qsub, not nohup |
| "I'll run the real job via qsub later" | You'll forget. The 'test' run is the one that flags the account | qsub from the start |
| "It only takes 30 seconds" | You don't know that until it runs. 173K filings over NFS is not 30 seconds | If in doubt, qsub |
| "The quorum parser ran fine on the login node last time" | It didn't — you got lucky, or it was killed silently | Look at how the quorum parser ACTUALLY runs: submit_quorum.sh |
| "I need a new Go binary for this extraction" | scan_covers already has a profile-based framework with SGE, concurrency, and path handling | Add a profile to scripts/scan_covers/, don't create a standalone binary |
| "I'll build the path logic myself" | You'll get the wrds_clean_filings directory structure wrong | Read references/edgar.md — cik_int.zfill(10)[:6]/{cik_int}/{accession}.txt |
| "This parser is different enough to need its own binary" | It isn't. scan_covers profiles handle header extraction, body parsing, and custom extractors | Add a profiles_*.go file. If you need custom logic, use the Custom field type |
ssh wrds '... | ./binary > output' → STOP. That's login-node compute. Write a submit script.ssh wrds 'nohup ... &' → STOP. nohup doesn't change the node. Use qsub.ssh wrds 'python3 ...' for anything that reads >10 files → STOP. Use qsub.references/edgar.md before building a new WRDS file parser → STOP. The path conventions, SGE patterns, and existing parsers are already documented. Read them first.scripts/scan_covers/ is a generic profile-based framework. Add a profiles_*.go file, not a new binary. The framework handles SGE sharding, path construction, concurrency, and form-type filtering.scripts/scan_covers/ → STOP. This framework exists precisely so you don't reinvent extraction infrastructure. Every standalone parser is technical debt that should have been a profile.scripts/scan_covers/ — generic profile-based Go framework with SGE, concurrency, path handlingprofiles_*.go file — not a standalone binary. The Profile struct supports pattern-based fields AND custom extractors (set FullBody: true for body-text searches like prospectus 485 filings — see profiles_proxy_advisors.go)references/edgar.md — path conventions, existing profiles, SGE submission patternsBuilding a standalone parser when scan_covers exists is NOT HELPFUL — it reinvents infrastructure that already handles SGE sharding, NFS concurrency, path construction, form-type filtering, and error handling. You built a 300-line standalone Go binary, ran it on the login node, got the path convention wrong, and spent 5 iterations fixing it. Adding a 60-line profile to scan_covers would have worked on the first try.
Every standalone EDGAR parser is technical debt. The scan_covers framework exists to eliminate this class of mistake.
</EXTREMELY-IMPORTANT>
WRDS (Wharton Research Data Services) provides academic research data via PostgreSQL at wrds-pgdata.wharton.upenn.edu:9737.
Before executing ANY WRDS query, you MUST:
This is not negotiable. Skipping sample inspection is NOT HELPFUL — the user builds analysis on data with undetected quality problems.
| Excuse | Reality | Do Instead | |--------|---------|------------| | "I'll add filters later" | You'll forget and pull bad data | Add filters NOW, before execution | | "User didn't specify filters" | Standard filters are ALWAYS required | Apply Critical Filters section defaults | | "Just a quick test query" | Test queries with bad filters teach bad patterns | Use production filters even for tests | | "I'll let the user filter in pandas" | Pulling millions of unnecessary rows wastes time/memory | Filter at database level FIRST | | "The query worked, so it's correct" | Query success ≠ data quality | INSPECT sample for invalid records | | "I can use f-strings for simple queries" | SQL injection risk + wrong type handling | ALWAYS use parameterized queries |
Before EVERY query execution:
For Compustat queries (comp.funda, comp.fundq):
indfmt = 'INDL'datafmt = 'STD'popsrc = 'D'consol = 'C'For CRSP v2 queries (crsp.dsf_v2, crsp.msf_v2):
sharetype == 'NS'securitytype == 'EQTY'securitysubtype == 'COM'usincflg == 'Y'issuertype.isin(['ACOR', 'CORP'])For Form 4 queries (tr_insiders.table1):
For ALL queries:
.head() or .sample() BEFORE claiming successWriting SAS code that forces full table scans when indexes exist is NOT HELPFUL — the user's job runs 100x slower than necessary and may timeout. </EXTREMELY-IMPORTANT>
Before EVERY SAS program execution:
For merges/joins:
PROC SORT + DATA merge)defineKey/defineData/defineDone pattern correctlyh.output() uses double quotes for macro resolution (not single quotes)call missing() initializes hash data variables for non-matchesFor WHERE clauses (CRITICAL):
year(date), month(date), datepart(dt) wrapping indexed columnsBETWEEN "01jan&year."d AND "31dec&year."d range patternupcase(), substr() on indexed columnsyear() = X AND quarter() = Y)For batch processing:
#$ -t start-end) not sequential loop-sysparm (not -set or %sysget)#$ -l m_mem_free=4G minimum)For PROC SQL:
calculated keyword used for computed column references in HAVINGFor macros:
&year. not &year)options mprint mlogic symbolgen used during development| Excuse | Reality | Do Instead | |--------|---------|------------| | "Sort-merge is simpler to write" | Hash is 10x faster for lookup joins and requires no sorting | Write the hash — it's 5 extra lines | | "year(date) is readable" | Readable but prevents index usage — full table scan on millions of rows | Use BETWEEN with date literals | | "I'll parallelize later" | Later never comes and the job runs 18x slower sequentially | Write the SGE array job NOW | | "Single quotes work fine in hash" | Single quotes block macro resolution — your output dataset name is wrong | ALWAYS double quotes in h.output() | | "PROC SQL is easier than hash" | PROC SQL still sorts for joins — hash avoids all sorting | Hash for lookups, SQL only for complex aggregations | | "The job only takes a few minutes per year" | 18 years × 3 minutes = 54 minutes sequential vs 3 minutes parallel | SGE array for ANY multi-year job | | "%sysget works for getting the year" | Unreliable in SGE context — may return blank silently | Use -sysparm + &sysparm. |
where year(date) = anything → STOP. Use BETWEEN with date literals.proc sort; data; merge for a lookup join → STOP. Use hash object.%do year = start %to end loop → STOP. Use SGE array job.h.output(dataset: '...') → STOP. Use double quotes.-set or %sysget for SGE task parameters → STOP. Use -sysparm.See references/sas-etl.md for complete patterns:
| Dataset | Schema | Key Tables |
|---------|--------|------------|
| Compustat | comp | company, funda, fundq, secd |
| ExecuComp | comp_execucomp | anncomp |
| CRSP | crsp | dsf, msf, stocknames, ccmxpf_linkhist |
| CRSP v2 | crsp | dsf_v2, msf_v2, stocknames_v2 |
| Form 4 Insiders | tr_insiders | table1, header, company |
| ISS Incentive Lab | iss_incentive_lab | comppeer, sumcomp, participantfy |
| Capital IQ | ciq | wrds_compensation |
| IBES | tr_ibes | det_epsus, statsum_epsus |
| Form D / Reg D | wrdssec | wrds_vc_formd (parsed, 2000–2020); index: wrdssec_all.forms (all CIKs) or wrds_forms (filer only) — default to forms, see references/wrds-forms-tables.md |
| SEC EDGAR | wrdssec_all | forms (raw index, all CIKs per filing — default), wrds_forms (filer-only view), wciklink_cusip |
| SEC Search | wrds_sec_search | filing_view, registrant |
| EDGAR | edgar | filings, filing_docs |
| Fama-French | ff | factors_monthly, factors_daily |
| LSEG/Datastream | tr_ds | ds2constmth, ds2indexlist |
| FJC (Federal Judicial Center) | fjc | civil, criminal, bankruptcy, appeals |
| FJC Linking | fjc_linking | wrds_civil_link, wrds_criminal_link |
| SDC New Issues (IPO/SEO/Debt) | tr_sdc_ni | wrds_ni_details — equity + debt offerings |
| SDC Mergers & Acquisitions | tr_sdc_ma | wrds_ma_details — M&A transactions |
| TAQ Legacy | taq | mast_YYYY, wrds_iid_YYYY — second-level (1993–2006) |
| TAQ Millisecond | taqmsec | mastm_YYYY, wrds_iid_YYYY, ctm_YYYYMM, complete_nbbo_YYYYMMDD |
| Thomson S12 (Mutual Fund Holdings) | tfn (SAS) / tr_mutualfunds (PG) | s12 — 13F/N-CSR fund holdings |
| Thomson S34 (13-F Institutional) | tfn (SAS) / tr_13f (PG) | s34 — 13-F institutional holdings |
| FISD / Mergent (Bonds) | fisd_fisd | fisd_mergedissue, fisd_mergedissuer |
| PitchBook | pitchbk_companies_deals, pitchbk_investors_funds_lps, pitchbk_fund_returns | deal, company, fund, wrds_fund_returns — dealsize in USD millions |
Initialize PostgreSQL connection to WRDS:
import psycopg2
conn = psycopg2.connect(
host='wrds-pgdata.wharton.upenn.edu',
port=9737,
database='wrds',
sslmode='require'
# Credentials from ~/.pgpass
)
Configure authentication via ~/.pgpass with chmod 600:
wrds-pgdata.wharton.upenn.edu:9737:wrds:USERNAME:PASSWORD
Connect via SSH tunnel:
ssh wrds
This uses ~/.ssh/wrds_rsa for authentication.
Always include for clean fundamental data:
WHERE indfmt = 'INDL'
AND datafmt = 'STD'
AND popsrc = 'D'
AND consol = 'C'
Equivalent to legacy shrcd IN (10, 11):
df = df.loc[
(df.sharetype == 'NS') &
(df.securitytype == 'EQTY') &
(df.securitysubtype == 'COM') &
(df.usincflg == 'Y') &
(df.issuertype.isin(['ACOR', 'CORP']))
]
WHERE acqdisp = 'D' -- Dispositions
AND trancode IN ('S', 'D', 'G', 'F') -- Sales, Dispositions, Gifts, Tax
Always use parameterized queries (never string formatting):
Use scalar parameter binding for single values:
cursor.execute("""
SELECT gvkey, conm FROM comp.company WHERE gvkey = %s
""", (gvkey,))
Use ANY() for list parameters:
cursor.execute("""
SELECT * FROM comp.funda WHERE gvkey = ANY(%s)
""", (gvkey_list,))
Detailed query patterns and table documentation:
references/compustat.md - Compustat tables, ExecuComp, financial variablesreferences/crsp.md - CRSP stock data, CCM linking, v2 formatreferences/insider-form4.md - Thomson Reuters Form 4, rolecodes, insider typesreferences/iss-compensation.md - ISS Incentive Lab, peer companies, compensationreferences/formd.md - Form D / Reg D: denormalization gotcha, dedup pattern, exemption codes, post-2020 gap, SEC TSV downloadreferences/edgar.md - SEC EDGAR filings, URL construction, DCN vs accession numbersreferences/form-d.md - SEC Form D private placement filings, industry categories, exemption typesreferences/connection.md - Connection pooling, caching, error handlingreferences/taq.md - TAQ: master files, IID, raw tick processing (NBBO, VWAP, closing auctions), CRSP–TAQ merge, era transition (legacy vs millisecond)references/sas-etl.md - SAS hash objects, index-friendly WHERE, SGE array jobs, PROC SQL optimizationreferences/postgres-vs-sas.md - Decision guide: when to use PostgreSQL vs SAS for WRDS ETL (benchmarks, constraints, hybrid pattern)references/fjc.md - FJC Integrated Database: civil/criminal case data, NOS codes, securities litigation queries, firm linkingreferences/sdc-issuances.md - SDC New Issues: IPOs, SEOs, 144A equity, debt offerings — schema discovery, cleaning filters, CRSP/Compustat linkingreferences/fisd-bonds.md - FISD/Mergent: corporate bond issuances, IG vs HY, 144A vs registered, rating classification, TRACE linkingreferences/sdc-ma.md - SDC M&A: deal counts, PE/LBO vs strategic buyer, deal status codes, public vs private targetreferences/fund-formation.md - Fund formation: Form D (pooled investment funds), EDGAR N-2 (closed-end fund IPOs), Form ADV (RIA registrations)references/pitchbook.md - PitchBook: schema architecture, dealsize/fundsize in USD millions, dealdate outliers, CIK crosswalk, fund performance (wrds_fund_returns), PE/VC/fund formation patternsreferences/proxy-advisors.md - Proxy-advisor customer identification: 485BPOS/485APOS body scan for ISS/Glass Lewis/Egan-Jones name variants; CRSP MFDB lift to mgmt_cd × year; validates against chongshu published CSVWorking code from real projects:
examples/form4_disposals.py - Insider trading analysis (from SVB project)examples/wrds_connector.py - Connection pooling patternexamples/formd_regd.ipynb - Form D / Reg D: dedup validation, SEC TSV download, exemption trend chartsexamples/sdc_issuances_eda.ipynb - SDC New Issues: annual IPO/SEO/debt counts, 144A share, IG vs HY breakdownexamples/sdc_ma_eda.ipynb - SDC M&A: annual deal counts, PE/LBO vs strategic, public vs private target trendsexamples/fund_formation_eda.ipynb - Fund formation: Form D 3C.1/3C.7 counts, EDGAR N-2 closed-end fund IPOs, Form ADV RIA registrationsexamples/pitchbook_eda.ipynb - PitchBook: PE deal activity, VC rounds by stage, fund formation by vintage, IRR/TVPI by strategyexamples/voting_ownership_pipeline/ - Self-contained hybrid SAS+Python pipeline: ISS votes, 13-F inst. ownership, MF holdings via MFLINKS, merged panel. Canonical example of PostgreSQL vs SAS decision-making on WRDS. See README.md for architecture and usage.scripts/test_connection.py - Validate WRDS connectivityWRDS-provided samples at ~/resources/wrds-code-samples/:
ResearchApps/CCM2025.ipynb - Modern CRSP-Compustat mergeResearchApps/ff3_crspCIZ.ipynb - Fama-French factor constructioncomp/sas/execcomp_ceo_screen.sas - ExecuComp patternsWhen querying historical data, leverage current date context for dynamic range calculations.
Current date is automatically available via datetime.now(). Apply this to:
Implement dynamic date ranges in queries:
from datetime import datetime, timedelta
# Query last 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)
query = """
SELECT * FROM comp.funda
WHERE datadate BETWEEN %s AND %s
"""
df = pd.read_sql(query, conn, params=(start_date, end_date))
Always incorporate current date awareness in date-dependent queries to ensure results remain fresh across time.
testing
Internal skill for literature review and source materialization. Called after brainstorm, before setup. NOT user-facing.
documentation
This skill should be used when the user asks to 'write a paper', 'start a writing project', 'draft an article', 'write about', 'brainstorm writing topics', 'gather sources for a paper', 'what should I write about', or needs the writing workflow entry point for any writing task.
testing
Validate draft sections cover all PRECIS claims before review.
testing
Internal skill for creating PRECIS.md, OUTLINE.md, and ACTIVE_WORKFLOW.md. Called after brainstorm sources are gathered.