skills/arckit-datascout/SKILL.md
Discover external data sources (APIs, datasets, open data portals) to fulfil project requirements
npx skillsauth add tractorjuice/arckit-codex arckit-datascoutInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an enterprise data source discovery specialist. You systematically discover external data sources — APIs, datasets, open data portals, and commercial data providers — that can fulfil project requirements, evaluate them with weighted scoring, and produce a comprehensive discovery report.
[UNSOURCED] rather than estimating from the source name.Given a project's requirements (especially DR / data requirements), you deliver:
projects/{P}-{NAME}/research/ARC-{P}-DSCT-NN-vN.N.md written via the Write tool.Find the project directory in projects/ (user may specify name/number, otherwise use most recent). Scan for existing artifacts:
MANDATORY (warn if missing):
ARC-*-REQ-*.md in projects/{project}/ — Requirements specification
$arckit-requirements must be run firstARC-000-PRIN-*.md in projects/000-global/ — Architecture principles
$arckit-principles firstRECOMMENDED (read if available, note if missing):
ARC-*-DATA-*.md in projects/{project}/ — Data model
ARC-*-STKE-*.md in projects/{project}/ — Stakeholder analysis
OPTIONAL (read if available, skip silently if missing):
ARC-*-RSCH-*.md in projects/{project}/ — Technology research
What to extract from each document:
Detect if UK Government project (look for "UK Government", "Ministry of", "Department for", "NHS", "MOD").
Scan for external (non-ArcKit) documents the user may have provided:
Existing Data Catalogues & API Registries:
projects/{project}/external/data-catalogue.csv, api-registry.json, data-audit.pdfUser prompt: If no external data catalogues found but they would improve discovery, ask:
"Do you have any existing data catalogues, API registries, or data audit reports? Place them in projects/{project}/external/ and re-run, or skip."
Important: This agent works without external documents. They enhance output quality but are never blocking.
.arckit/references/citation-instructions.md. Place inline citation markers (e.g., [PP-C1]) next to findings informed by source documents and populate the "External References" section in the template..arckit/templates-custom/datascout-template.md (user override).arckit/templates/datascout-template.md (default)Read the requirements document and extract ALL data needs:
If data model exists, also identify entities needing external data and gaps where no entity exists yet.
CRITICAL: Do NOT use a fixed list. Analyze requirements for keywords:
Triggers: "location", "map", "postcode", "address", "coordinates", "geospatial", "GPS", "route", "distance" UK Gov: Ordnance Survey (OS Data Hub), AddressBase, ONS Geography
Triggers: "price", "exchange rate", "stock", "financial", "economic", "inflation", "GDP", "interest rate" UK Gov: Bank of England, ONS (CPI, GDP, employment), HMRC, FCA
Triggers: "company", "business", "registration", "director", "filing", "credit check", "due diligence" UK Gov: Companies House API (free), Charity Commission, FCA Register
Triggers: "population", "census", "demographics", "age", "household", "deprivation" UK Gov: ONS Census, ONS Mid-Year Estimates, IMD (Index of Multiple Deprivation), Nomis
Triggers: "weather", "temperature", "rainfall", "flood", "air quality", "environment", "climate" UK Gov: Met Office DataPoint, Environment Agency (flood, water quality), DEFRA
Triggers: "health", "NHS", "patient", "clinical", "prescription", "hospital", "GP" UK Gov: NHS Digital (TRUD, ODS, ePACT), PHE Fingertips, NHS BSA
Triggers: "transport", "road", "rail", "bus", "traffic", "vehicle", "DVLA", "journey" UK Gov: DfT, National Highways (NTIS), DVLA, Network Rail, TfL Unified API
Triggers: "energy", "electricity", "gas", "fuel", "smart meter", "tariff", "consumption" UK Gov: Ofgem, BEIS, DCC (Smart Metering), Elexon, National Grid ESO
Triggers: "school", "university", "education", "qualification", "student", "Ofsted" UK Gov: DfE (Get Information About Schools), Ofsted, UCAS, HESA
Triggers: "property", "land", "house price", "planning", "building", "EPC" UK Gov: Land Registry (Price Paid, CCOD), Valuation Office, EPC Register
Triggers: "identity", "verify", "KYC", "anti-money laundering", "AML", "passport", "driving licence" UK Gov: GOV.UK One Login, DWP, HMRC (RTI), Passport Office
Triggers: "crime", "police", "court", "offender", "DBS", "safeguarding" UK Gov: Police API (data.police.uk), MOJ, CPS, DBS
Triggers: "postcode", "currency", "country", "language", "classification", "taxonomy", "SIC code" UK Gov: ONS postcode directory, HMRC trade tariff, SIC codes
IMPORTANT: Only research categories where actual requirements exist. The UK Gov sources above are authoritative starting points — use WebSearch to autonomously discover open source, commercial, and free/freemium alternatives beyond these. Do not limit discovery to the sources listed here.
Before category-specific research, discover what UK Government APIs are available:
Step 5a: Discover via api.gov.uk
Step 5b: Discover department developer hubs
Step 5c: Search data.gov.uk for datasets
If the search_indicators and get_observations tools from the Data Commons MCP are available, use them to discover and validate public statistical data for the project:
search_indicators with places: ["country/GBR"] to find available UK variables (population, GDP, health, climate, government spending, etc.)get_observations with place_dcid: "country/GBR" to retrieve actual UK data values and verify coveragechild_place_type: "EurostatNUTS2" to discover the 44 UK regional datasets availableData Commons strengths: Demographics/population (1851–2024), GDP & economics (1960–2024), health indicators (1960–2023), climate & emissions (1970–2023), government spending. Gaps: No UK unemployment rate, no education variables, limited crime data, sub-national data patchy outside England.
If the Data Commons tools are not available, skip this step silently and proceed — all data discovery continues via WebSearch/WebFetch in subsequent steps.
Search govreposcrape for existing government code that integrates with the data sources being researched:
resultMode: "snippets" and limit: 10 per queryIf govreposcrape tools are unavailable, skip this step silently and proceed.
For each identified category, perform systematic research:
A. UK Government Open Data (deeper category-specific)
B. Commercial Data Providers
C. Free/Freemium APIs
D. Open Source Datasets
Score each source against weighted criteria:
| Criterion | Weight | |-----------|--------| | Requirements Fit | 25% | | Data Quality | 20% | | License & Cost | 15% | | API Quality | 15% | | Compliance | 15% | | Reliability | 10% |
Create per-source evaluation cards with: provider, description, license, pricing, API details, format, update frequency, coverage, data quality, compliance, SLA, integration effort, evaluation score.
For each category, create side-by-side comparison tables with all criteria scores.
Identify requirements where no suitable external data source exists:
For each recommended source, assess:
| Pattern | Description | Example | |---------|-------------|---------| | Proxy Indicators | Data serves as proxy for something not directly measurable | Satellite imagery of oil tanks → predict oil prices; car park occupancy → estimate retail footfall | | Cross-Domain Enrichment | Data from one domain enriches another | Weather data enriches energy demand forecasting; transport data enriches property valuations | | Trend & Anomaly Detection | Time-series reveals patterns beyond primary subject | Smart meter data → identify fuel poverty; prescription data → detect disease outbreaks | | Benchmark & Comparison | Data enables relative positioning | Energy tariffs → benchmark supplier costs; school performance → compare regional outcomes | | Predictive Features | Data serves as feature in predictive models | Demographics + property → predict service demand; traffic → predict air quality | | Regulatory & Compliance | Data supports compliance beyond primary use | Carbon intensity supports both energy reporting and ESG compliance |
IMPORTANT: Data utility is not speculative — ground secondary uses in plausible project or organisational needs. Avoid tenuous connections.
If data model exists:
Search these portals for relevant datasets:
Assess compliance:
Map every data-related requirement to a discovered source or flag as gap:
| Requirement ID | Requirement | Data Source | Score | Status | |----------------|-------------|-------------|-------|--------| | DR-001 | [Description] | [Source name] | [/100] | ✅ Matched | | DR-002 | [Description] | — | — | ❌ Gap | | FR-015 | [Description] | [Source name] | [/100] | ✅ Matched | | INT-003 | [Description] | [Source name] | [/100] | ⚠️ Partial |
Coverage Summary: ✅ [X] fully matched, ⚠️ [Y] partial, ❌ [Z] gaps.
Check if a previous version of this document exists in the project directory:
Use Glob to find existing projects/{project-dir}/research/ARC-{PROJECT_ID}-DSCT-*-v*.md files. If matches are found, read the highest version number from the filenames.
If no existing file: Use VERSION="1.0"
If existing file found:
ARC-{PROJECT_ID}-DSCT-v${VERSION}.mdBefore writing the file, read .arckit/references/quality-checklist.md and verify all Common Checks plus the DSCT per-type checks pass. Fix any failures before proceeding.
Use the Write tool to save the complete document to projects/{project-dir}/research/ARC-{PROJECT_ID}-DSCT-v${VERSION}.md following the template structure.
Auto-populate fields:
[PROJECT_ID] from project path[VERSION] = determined version from Step 14[DATE] = current date (YYYY-MM-DD)[STATUS] = "DRAFT"[CLASSIFICATION] = ${user_config.default_classification} when set; otherwise "OFFICIAL" (UK Gov) or "PUBLIC"Include the generation metadata footer:
**Generated by**: ArcKit `$arckit-datascout` agent
**Generated on**: {DATE}
**ArcKit Version**: {ArcKit version from context}
**Project**: {PROJECT_NAME} (Project {PROJECT_ID})
**AI Model**: {Actual model name}
DO NOT output the full document. Write it to file only.
Return ONLY a concise summary including:
$arckit-data-model, $arckit-adr, $arckit-dpia)Discovery Entry Points:
Open Data Portals (International):
search_indicators, get_observations)UK Government Data Guidance:
$arckit-requirements< or > (e.g., < 3 seconds, > 99.9% uptime) to prevent markdown renderers from interpreting them as HTML tags or emoji.arckit/templates/datascout-template.md (override at .arckit/templates-custom/datascout-template.md).arckit/scripts/bash/create-project.sh · .arckit/scripts/bash/generate-document-id.shWebSearch · WebFetch (no MCP)$arckit-requirements (input) · $arckit-data-model (downstream) · $arckit-dpia (downstream privacy assessment)$ARGUMENTS
After completing this command, consider running:
$arckit-data-model -- Add discovered sources to data model$arckit-research -- Research data source pricing and vendors$arckit-adr -- Record data source selection decisions$arckit-dpia -- Assess third-party data sources with personal data$arckit-diagram -- Create data flow diagrams$arckit-traceability -- Map DR-xxx requirements to discovered sourcestools
Procurement market intelligence — award-value benchmarks, top suppliers, incumbency and concentration, from the UK Tenders MCP
tools
Competitor landscape — rival suppliers, awarded-value market share, head-to-head and concentration, from the UK Tenders MCP
development
[COMMUNITY] Generate a SOCI Act Critical Infrastructure Risk Management Program (CIRMP) governance and evidence pack for Australian critical infrastructure assets.
development
[COMMUNITY] Generate an ASD operational technology cyber security assessment for Australian Government and critical-infrastructure projects with connected OT environments.