skills/niche-signal-discovery/SKILL.md
Discover niche first-party signals that differentiate Closed Won vs Closed Lost accounts for ICP analysis. Use when the user provides won/lost customer domain lists and wants differential signals (website content, job listings, tech stack, maturity markers) to build account scoring models and prospecting criteria. Triggers: ICP analysis, niche signals, won vs lost analysis, differential signals, signal discovery, ICP signal report, account scoring signals, lead scoring, first-party signals, buyer signals. Before reading this file, first read deepline-gtm to understand the Deepline CLI tool and how to use it. Then read this file for guidance on the task.
npx skillsauth add getaero-io/gtm-eng-skills niche-signal-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Discover differential signals between Closed Won and Closed Lost accounts by extracting multi-page website content and job listings, then computing Laplace-smoothed lift scores to identify what distinguishes buyers from non-buyers.
deepline enrich. No separate API keys for exa/crustdata/apollo etc.Use deepline enrich for all enrichment, deepline tools execute for one-offs, deepline playground for inspection. Reruns are idempotent. Refer to deepline-gtm for command patterns and provider playbooks.
0. Discover target company (what they sell, who they sell to)
0.5. Discover ecosystem (competitors, tech stack, buyer personas)
1. Prepare input CSV (deduplicate within won/lost groups)
1.0.5 Build "do not re-contact" index from user's existing list (scripts/dedupe_utils.py)
1.5. Generate vertical-specific configs (keywords, tools, job roles)
2. Multi-page website + job extraction (deepline enrich)
3. Quality gate — verify file completeness + coverage (>80%)
3.5. Review configs against enriched data
4. Differential analysis (scripts/analyze_signals.py)
5. Generate report — every top signal must include cited evidence
6. Signal interpretation review
7. Top 10 net-new prospects [REQUIRED] + contacts/emails [optional, costs credits]
Step 7 is required. A signal report without 10 actionable companies forces the reader to do their own prospecting pass — exactly the expensive thing they wanted to skip. Contacts/emails are optional only because they cost extra credits; always offer them.
Highest → lowest confidence:
When website signals fail: For B2B back-office tools (AR, billing, compliance), buyers don't publish their pain on marketing pages. Prioritize jobs + tech stack + firmographics for these verticals.
CRM fields populated by AE activity — catalyst note count, OCR-derived counts (number_of_champions_c, number_of_decision_makers_c), MEDDPICC picklists, any "did the AE do X on this opp" field — correlate with win-rate as engagement artifacts, not causal signals. They get filled in after the AE decides an opp is worth working. Never use them as scoring inputs. On one real run, catalyst notes showed "109x lift" — almost made the TL;DR before we caught the direction of causality.
Rule of thumb: every scoring input must be observable BEFORE the AE touches the account. Read references/scoring-pitfalls.md for the full list and the "safer alternative read" for loss-reason data.
Do this FIRST. The entire pipeline (exa query, keywords, tech stack, job roles) adapts based on this discovery; skipping it produces generic/irrelevant signals.
deeplineagent: "Research {{company-domain}}. Summarize what the company sells, who they sell to, what makes them different, and any example customers."
Document: (1) product category, (2) target buyer persona, (3) key differentiation, (4) example customers.
Three parallel deeplineagent queries:
"{product category} software alternatives competitors" → 3-5 names"{buyer persona} software stack" → 10-15 tools by category"{buyer persona} job titles" → 10-15 title variationsThese feed Step 1.5 config generation.
domain,status
customer1.com,won
non-customer1.com,lost
Deduplicate within the input. If a domain appears in BOTH won and lost (same company, multiple deals), Deepline only fetches job listings once — silently undercounting won_with_jobs. Remove ALL rows for cross-group domains:
from collections import Counter
counts = Counter(r['domain'] for r in rows)
duplicate_domains = {d for d, c in counts.items() if c > 1}
# Drop every row in duplicate_domains, not just one copy.
Before any prospects ship in Step 7, dedupe candidates against whatever "already known" list the user provides — customers, CRM export, past outbound, a previous run's output. Always ask explicitly; if the user has no list, note it as a caveat in the final report rather than silently skipping.
Order: apex domain first, fuzzy company name as fallback. Use the shipped helper — it handles public-suffix multi-label TLDs (co.uk, co.jp, com.au) and corporate-suffix stripping:
python3 scripts/dedupe_utils.py --selftest # one-time sanity check
python3 scripts/dedupe_utils.py \
--existing customers.csv --candidates prospects_raw.csv \
--out-actionable prospects_actionable.csv --out-matched already_known.csv
Don't silently drop CRM matches — categorize them: Net-new / Account-only / Re-engage / Active-open / Current-customer.
Read references/dedupe.md for the failure modes (raw-string match missing amsynergy.nikon.com → nikon.com cost 24 of 50 prospects in one run), category definitions, and library usage.
Create three JSON files in output/{{company}}/:
{{company}}-keywords.json # product category, pain language, competitor names, maturity terms
{{company}}-tools.json # niche SaaS tools by category
{{company}}-job-roles.json # buyer persona job titles
Read references/keyword-catalog.md for the JSON schema, generation patterns, and multi-vertical examples (creative ops, AR automation, sales engagement, developer tools).
Validation: Do the configs match the target's vertical and buyer persona? If not, refine based on Step 0/0.5 findings.
Never scrape just the homepage. Use Serper to discover relevant pages, Firecrawl to extract content.
Step 2a - Discover pages with Serper (0.02 credits/company):
deepline enrich \
--input output/{{company}}-icp-input.csv \
--output output/{{company}}-discovered.csv \
--with '{"alias":"pages","tool":"serper_google_search","payload":{"query":"site:{{domain}} product OR features OR integrations OR customers OR security OR pricing OR careers OR about"}}' \
--json
Adapt the query by vertical: add compliance OR audit for back-office, documentation OR api for developer tools, portfolio OR workflow for creative tools.
Step 2b - Scrape top 5 pages with Firecrawl (0.05 credits/company):
Extract URLs from Serper results, then scrape each:
deepline enrich \
--input output/{{company}}-urls.csv \
--output output/{{company}}-scraped.csv \
--with '{"alias":"content","tool":"firecrawl_scrape","payload":{"url":"{{url}}"}}' --json
Aggregate scraped pages back into one row per domain, formatted as {"data":{"results":[{url, title, text}]}} for the analysis script.
Step 2c - Job listings with Crustdata (0.40 credits/company):
deepline enrich \
--input output/{{company}}-aggregated.csv \
--output output/{{company}}-enriched.csv \
--with '{"alias":"jobs","tool":"crustdata_job_listings","payload":{"companyDomains":["{{domain}}"]}}' --json
Total cost: ~0.47 credits/company. Get user approval first. Example: "60 companies x 0.47 = ~28 credits."
deepline enrich returns to terminal before OS buffers fully flush. Running the analysis script immediately can read a partially-written file and produce won_with_jobs: 0 even when data is fine. Always verify:
INPUT_ROWS=$(wc -l < output/{{company}}-icp-input.csv)
OUTPUT_ROWS=$(wc -l < output/{{company}}-enriched.csv)
echo "Input: $INPUT_ROWS, Output: $OUTPUT_ROWS" # should match
Then spot-check that won rows have job data, that website coverage is >80%, and that average content depth is 6-8 pages / 12-20K chars per company.
Read references/quality-gate.md for the full verification script, the buffer-flush retry pattern, and the "auto-extracted domain validation" check that has caught up to 53% false-positive rates in CRM-exported customer lists.
deepline playground output/{{company}}-enriched.csv
Red flags:
Fix and regenerate configs if needed.
python3 scripts/analyze_signals.py \
--input output/{{company}}-enriched.csv \
--keywords output/{{company}}-keywords.json \
--tools output/{{company}}-tools.json \
--job-roles output/{{company}}-job-roles.json \
--output output/{{company}}-analysis.json
The script computes substring-match presence, Laplace-smoothed lift, source breakdown (website/jobs/both), tech-stack mentions, job-role prevalence, anti-fit signals, and per-keyword evidence quotes (±40 chars with URLs) — the evidence array is what Step 5 renders.
Read references/report-template.md for the full report structure (Quick Reference Dashboard at the top, then detail sections), the signal-strength visual scale, Apollo URL format, and all quality rules. Critical rules in brief:
15% (6), not just 15%); sample sizes in headers (Won (n=37))*(insufficient evidence)*.Read references/signal-interpretation.md before writing interpretation columns. Key rules:
10 companies are required for every run; contacts + emails are optional (additional Deepline credits). Always offer contact discovery; only run it if the user approves the spend.
# Companies only — no extra credits beyond Step 2 enrichment:
python3 scripts/find_contacts.py --input prospects_actionable.csv --output top10.csv --top 10 --no-contacts
# Companies + contacts + emails — asks for credit approval.
# --roles is REQUIRED in --contacts mode and must be the buyer-persona job
# titles surfaced in YOUR Step 0/0.5 (not a stale list from a different vertical):
python3 scripts/find_contacts.py --input prospects_actionable.csv --output top10.csv --top 10 \
--contacts --roles "<persona job titles from Step 0.5>"
When --contacts is on, the orchestrator runs a 3-phase chain via Deepline:
company_to_contact_by_role_waterfall (free, mature companies)exa_search_people fallback for any company Phase 1 missed — mandatory. On the run that motivated this, Phase 1 returned 0 contacts on all 10 top prospects (small/non-US industrial); Exa found 15 real contacts at 6 of those 10 in the same pass.name_and_domain_to_email_waterfall with linkedin_url supplied and apex-domain validation — providers return stale addresses (@orbitalatk.com for someone now at X-Bow, personal Gmails, wrong-company false positives). Mismatched apex → publish "(email not found)", keep the raw value in raw_email for auditing.Read references/step-7-prospects.md for the required vs. optional output fields, the prospect-card skeleton, the Phase 2 Exa guardrails (title parsing + company-match filter), and the "10 is a ceiling, not a floor" guidance.
After enrichment, each row has:
website column → JSON: {"data":{"results":[{text, url, title}]}} (aggregated from Firecrawl scrapes)jobs column → JSON: {"result":{"listings":[{title, description, url}]}} (Crustdata format - note result not data, title not job_title)scripts/analyze_signals.py auto-detects __dl_full_result__ columns; override with --website-col N --jobs-col N for other column names.
amsynergy.nikon.com ≠ nikon.com for naive comparison. Always use extract_apex(). 24 of 50 "net-new" prospects in one real run were already in the CRM as parent-domain entries the raw-string dedupe missed.Read references/pitfalls.md for the full 18-item list including substring false positives, vendor-vs-buyer signal context, back-office-tool interpretation, and shipping-without-prospects.
Read references/proven-signals.md for typical lift ranges across verticals (analyst validation 4.5-6.5x, hiring signals 3.8-5.5x, compliance infra 2.1-6.5x, etc.), high-confidence anti-fit patterns (consumer signals 0.2x, retention/churn 0.2-0.4x), and a starter 0-100 scoring model with three tiers (Core Fit / Buying Intent / Infrastructure Readiness).
references/keyword-catalog.md — JSON schema + multi-vertical examples for Step 1.5 config generationreferences/dedupe.md — Step 1.0.5 dedupe failure modes, categorization rules, library usagereferences/quality-gate.md — Step 3 verification scripts, buffer-flush retry pattern, auto-extracted-domain validationreferences/report-template.md — Step 5 full report structure, signal-strength scale, Apollo URL format, all quality rulesreferences/signal-interpretation.md — Step 6 buyer-vs-seller-vs-competitor rulesreferences/step-7-prospects.md — Step 7 prospect-card skeleton, Exa guardrails, Phase 3 apex validationreferences/scoring-pitfalls.md — Confirmation-biased CRM fields to exclude from scoringreferences/pitfalls.md — Full 18-item pitfalls listreferences/proven-signals.md — Typical lift ranges + scoring model guidancescripts/analyze_signals.py — Step 4 differential analysis. Auto-detects columns.scripts/dedupe_utils.py — Step 1.0.5 + Step 7 email validation. extract_apex(), norm_name(), match_against_existing(). Stdlib only. --selftest flag for one-time install verification.scripts/find_contacts.py — Step 7 orchestrator. --contacts / --no-contacts toggle, 3-phase Deepline chain.{"result":{"listings":[]}} wrapper. Verified E2E on 15 companies.--contacts/--no-contacts), references/scoring-pitfalls.md warning about confirmation-biased CRM fields, mandatory citation rule. Shipped scripts/dedupe_utils.py + scripts/find_contacts.py. Aggressively trimmed inline detail to references — moved Step 3 quality gate, Step 5 quality rules, Common Pitfalls (items 7-15), and Proven Signal Patterns into references/. SKILL.md went from 650 to ~250 lines via progressive disclosure.devops
Use this skill when answering business analytics, RevOps, GTM metric, pipeline, revenue, funnel, customer, or warehouse questions with Deepline. Triggers on phrases like 'query Snowflake', 'analyze pipeline', 'total ACV', 'break down by quarter', 'use the semantic layer', 'run a semantic query', or any use of snowflake_get_semantic_layer / snowflake_run_semantic_query. Skip prospecting, enrichment, contact finding, outbound, or personalization workflows; use deepline-gtm for those.
tools
Use for Deepline SDK/CLI V2 GTM work: route, build, run, debug, and export durable plays; find companies or contacts; enrich email, phone, LinkedIn, or custom signals; migrate Clay-like tables; configure cron/webhook/API plays; control Deepline spend; and recover from play/tool shape failures. Triggers on deepline CLI work, CSV enrichment, prospecting, waterfalls, outbound lists, provider routing, play authoring, SDK V2 syntax, staleAfterSeconds, datasets, and eval-style GTM tasks.
development
Use to prospect, enrich, qualify, and activate outbound — especially when users mention Deepline, CSV processing, lead/account/contact research, waterfall enrichment, email or LinkedIn lookup, personalization, scoring, or campaign activation. Route CSV-heavy and provider-driven requests here, then rely on linked sub-docs and provider playbooks for execution details. Providers: adyntel, ai_ark, apify, apollo, attio, bettercontact, bloomberry, builtwith, cloudflare, contactout, crustdata, crustdata-v2, customer_db, dataforseo, datagma, deepline_native, deeplineagent, discolike, dropleads, exa, findymail, firecrawl, forager, fullenrich, generic_http, google_ads_audiences, heyreach, hubspot, hunter, icypeas, instantly, ipqs, leadmagic, lemlist, limadata, linkedin_ads_audiences, linkedin_scraper, lusha, meta_audiences, openmart, openwebninja, parallel, peopledatalabs, predictleads, prospeo, rocketreach, salesforce, serper, slack, smartlead, snowflake, theirstack, trestle, upcell, wiza, zerobounce.
testing
Run a quick Deepline demo recipe to show the user how Deepline works.