skills/job-scan/SKILL.md
Discover new job postings from ATS platforms — scans Greenhouse, Ashby, Lever APIs, filters by title, deduplicates, and feeds the evaluation pipeline.
npx skillsauth add khetansarvesh/ai_skills_repo job-scanInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Discover new job postings by scanning ATS platforms (Greenhouse, Ashby, Lever) directly via their public APIs. Filters by title keywords, deduplicates against Notion, and adds new offers to Notion with status "Scanned" for evaluation.
.env at repo root)npx playwright install chromium) — required for liveness verificationNavigate directly to each company's careers_url with a browser:
When to use: For companies with careers_url configured. Best for companies you check regularly.
The scout_specials.py script uses this level. Hits public JSON APIs directly:
| ATS | API Endpoint | Parser |
| -------------- | ----------------------------------------------------------- | ------------------------------------- |
| Greenhouse | https://boards-api.greenhouse.io/v1/boards/{company}/jobs | jobs[] → title, absolute_url |
| Ashby | https://api.ashbyhq.com/posting-api/job-board/{company} | jobs[] → title, jobUrl |
| Lever | https://api.lever.co/v0/postings/{company} | Root array [] → text, hostedUrl |
Auto-detection: The script detects the ATS provider from the careers_url pattern:
jobs.ashbyhq.com/{slug} → Ashby APIjobs.lever.co/{slug} → Lever APIapi field containing "greenhouse" → Greenhouse APIConcurrency: 10 parallel workers (HTTP fetch pool, not Playwright).
Use WebSearch with site: filters to discover companies NOT yet in tracked_companies:
site:jobs.ashbyhq.com "AI Engineer" OR "ML Engineer"site:job-boards.greenhouse.io "Research Scientist" OR "Applied AI"CRITICAL: WebSearch results may be stale (Google caches for weeks). Before adding to Notion, verify liveness with Playwright or check-liveness.mjs.
scout_specials.py — fast, reliable, zero-token (run this most often)Loaded from the Notion Preferences page. Applied centrally in dedup_liveness_upload.py (not during collection). View current keywords:
python3 scripts/notion/page_preferences.py --title-filter --pretty
Rule: At least 1 positive keyword must match AND 0 negative keywords can match (case-insensitive).
Matching logic:
\bAI\b matches "AI Engineer" but NOT "Chair" or "MAID"Edit keywords directly in the Notion Preferences page under "Positive Job Title Filters" and "Negative Job Title Filters" headers.
Before adding any offer, check against the Notion applications DB:
This prevents re-scanning the same offer even if it appears across multiple levels.
Runs on ALL candidates in the pipeline (API + WebSearch + user input). The dedup_liveness_upload.py script checks every URL after dedup, regardless of source.
Using the script:
node skills/job-scan/scripts/liveness_helpers/check-liveness.mjs URL1 URL2 URL3
# or from a file:
node skills/job-scan/scripts/liveness_helpers/check-liveness.mjs --file urls.txt
Classification logic (liveness_helpers/liveness-core.mjs):
| Signal | Result |
| --------------------------------------------------------------------------------------- | ------------- |
| HTTP 404 or 410 | Expired |
| URL contains ?error=true (Greenhouse redirect) | Expired |
| Page text matches expired patterns ("job no longer available", "position filled", etc.) | Expired |
| Visible Apply/Submit button in main content (not nav/footer) | Active |
| Page has content but no Apply button | Uncertain |
| Content < 300 chars (nav/footer only) | Expired |
Expired patterns detected (multilingual):
Visibility check: Only counts Apply buttons that are:
<nav>, <header>, or <footer>aria-hidden="true"display: none or visibility: hiddenThis is the standard pipeline that runs every time /job-scan is invoked. ALL steps are mandatory — do NOT skip WebSearch or liveness.
Important: Title filtering is centralized in dedup_liveness_upload.py, NOT in the collection steps. All sources just dump raw candidates into candidate_store.json. Filtering happens once at the end.
Step 1: Run API scan, writing to candidate store. NEVER use --hours 0 — it disables the time filter entirely and floods the pipeline with stale jobs (some months old). The default 24h filter is correct for daily scans. For weekly scans, use --hours 168. For broader catch-up scans, use --hours 720 (30 days) at most:
python3 skills/job-scan/scripts/scout_specials.py
Step 1.5 (MANDATORY): WebSearch fallback for companies without ATS APIs:
scout_specials.py prints companies it skipped (no detectable API) with their careers URLs. ALWAYS run WebSearch for these skipped companies — they often include top employers (OpenAI, Meta, Apple, etc.) that don't have public JSON APIs.
How to build the query: Use site:{careers_page_url} followed by positive keywords from the Notion Preferences page (the same keywords used by the title filter). Pick the most relevant 3-5 keywords for each query.
Example for Google DeepMind (https://deepmind.google/about/careers/):
site:deepmind.google "Research Engineer" OR "ML Engineer" OR "Research Scientist" OR "AI Engineer"
Example for OpenAI (https://openai.com/careers):
site:openai.com/careers "Research Engineer" OR "Applied AI" OR "Machine Learning" OR "LLM"
For each skipped company:
site:{careers_url} + positive keywords from Notion Preferences/jobs/12345, or known ATS pattern) → add directly to candidate store/careers, /search?, /job-category/, or has no job-specific identifier) → run the Adaptive Career Page Crawl protocol belowcandidate_store.json directly. Background agents lack file write permissions. Instead, return the array of candidates in the agent's result, and the main orchestrator (you) will append them to candidate_store.json after the agent completes.Process skipped companies sequentially (one browser tab at a time). Chrome DevTools MCP cannot run multiple tabs in parallel for crawling.
When a URL is a landing/category page, follow this protocol to extract individual job URLs using the site's own filters.
Step A — Open & Orient:
mcp__chrome-devtools__new_page with the landing page URLmcp__chrome-devtools__take_snapshot to get the accessibility treesearchbox, combobox, textbox with labels like "search", "keyword", "location", "country", "team", "category"Step B — Apply Filters (Type A only):
Location filter (priority 1):
click to open it, take_snapshot to see options, click "United States" / "US" / "USA" / "North America" (whichever available)fill with "United States", press Enter or select from autocompleteKeyword filter (priority 2):
fill with the top 2-3 positive keywords from Notion Preferences joined by space (e.g., "ML Engineer Research Scientist"). Do NOT use OR syntax — type natural terms.Trigger search: Press Enter or click a "Search" / "Apply" / "Filter" button if visible. Wait 2-3 seconds for results.
Verify results: Call take_snapshot again.
Step C — Extract Job Links (Type A filtered + Type B):
Run this JavaScript via mcp__chrome-devtools__evaluate_script:
(() => {
const links = Array.from(document.querySelectorAll('a[href]'));
const seen = new Set();
return links.filter(a => {
const href = a.href;
const path = new URL(href, location.origin).pathname;
return /\/(jobs?|details|position|opening|apply|posting|role)\b/i.test(path)
|| /\/\d{5,}/.test(path)
|| /[0-9a-f]{8}-[0-9a-f]{4}-/.test(path);
}).map(a => ({
url: a.href,
title: a.textContent.trim().replace(/\\s+/g, ' ').substring(0, 200)
})).filter(j => j.title.length > 3 && !seen.has(j.url) && seen.add(j.url));
})()
evaluate_script fails or returns empty: fall back to reading the snapshot for links with job-like patterns{"company": "<company_name>", "role": "<title>", "url": "<url>", "source": "career_crawl"}Step D — Handle Pagination (max 3 pages):
Step E — Handle Failures (Type C + errors):
crawl failed: page did not load for {company}. Move to next company.crawl failed: login/CAPTCHA required for {company}. Move to next company.crawl failed: no job links detected on {url}. Move to next company.After processing all skipped companies, print a summary:
Career page crawl: X companies attempted, Y succeeded (Z total jobs), W failed
Company: TechCorp (skipped, careers_url: https://careers.techcorp.com/engineering)
1. WebSearch: site:careers.techcorp.com "ML Engineer" OR "Research Scientist"
→ Returns https://careers.techcorp.com/engineering?team=ml (landing page)
2. Open page → take_snapshot
→ Finds: searchbox (uid: s42), location dropdown (uid: s78), 150 job listings
→ Classified as Type A (filter-capable)
3. Apply filters:
→ Click location dropdown (s78) → select "United States"
→ Fill search box (s42) with "ML Engineer"
→ Press Enter → wait 3s → take_snapshot
→ Now shows 12 results (down from 150)
4. Extract links:
→ evaluate_script returns 12 job objects with specific URLs
→ e.g., {url: "careers.techcorp.com/jobs/12345/senior-ml-engineer", title: "Senior ML Engineer"}
5. No page 2 → done
6. Append 12 candidates to candidate_store.json with source: "career_crawl"
Step 1.75 (MANDATORY): Jobright.ai scrape
Jobright.ai is a job aggregator with AI-powered matching. If the user has a Jobright tab open with filters applied, scrape it for additional job discoveries.
Call mcp__chrome-devtools__list_pages and look for a page URL containing jobright.ai
"Jobright: No tab found, skipping" and move to Step 2mcp__chrome-devtools__select_page to switch to itAuto-refresh the page to get the latest listings. Call mcp__chrome-devtools__navigate_page with the current Jobright URL (e.g., https://jobright.ai/jobs/recommend). Then call mcp__chrome-devtools__wait_for with text ["Recommended", "APPLY"] and timeout 10000ms to ensure the page has fully loaded.
All extraction functions are in skills/job-scan/scripts/jobright_helpers/extract-jobright.mjs. Read the file and inject functions via evaluate_script.
Scroll to load all jobs. Call evaluate_script with the scrollAndCount() function from extract-jobright.mjs. Returns the total number of job cards loaded.
Extract all job cards. Call evaluate_script with the extractJobs() function. Returns array of {title, company, jobrightUrl, location}.
Resolve real ATS URLs. Call evaluate_script with resolveUrls(jobrightUrls) passing the array of Jobright URLs. This fetches each info page in-browser (where auth cookies exist) and regex-matches real ATS URLs (Greenhouse, Ashby, Lever, Workday, Personio, etc.).
For any greenhouse-embed:{token} results, call resolveGhEmbed(slug, token) to resolve via the Greenhouse boards API. Derive slug from company name: lowercase, remove spaces/special chars (e.g., "Anduril Industries" → "andurilindustries").
Merge and save. Combine extracted jobs with resolved URLs. Each job should have: title, company, url (real ATS URL or Jobright fallback), location. Save as skills/job-scan/jobright_raw.json.
Run the processing script to normalize and append to candidate store:
python3 skills/job-scan/scripts/resolve_jobright.py
Print the summary from the script output.
Step 2 (MANDATORY): WebSearch discovery for broad queries:
Load search queries from Notion by running:
python3 scripts/notion/page_preferences.py --search-queries --pretty
For each query:
candidate_store.json directly. Background agents lack file write permissions. Instead, return the array of candidates in the agent's result, and the main orchestrator (you) will append them to candidate_store.json after the agent completes.Each candidate should be: {"company": "...", "role": "...", "url": "...", "source": "web_search"}
Use a background agent to run all broad discovery queries in parallel for speed. After the agent completes, parse its returned JSON and append to candidate_store.json yourself.
IMPORTANT — Background Agent File Write Pattern: Background agents cannot write files (permissions are denied automatically). Always instruct background agents to:
candidate_store.jsonNever instruct a background agent to write/edit/append to candidate_store.json directly.
Step 2.5 (MANDATORY): Add user-input jobs from Notion Preferences:
Load manually added job URLs:
python3 scripts/notion/page_preferences.py --user-input-jobs --pretty
For each URL listed, append to skills/job-scan/candidate_store.json with:
{"company": "(extract from URL or page title)", "role": "(extract from page title)", "url": "...", "source": "user_input"}
This allows you to paste job URLs directly into the Notion Preferences page under "User Input Jobs" and have them flow through the same pipeline.
Step 3 (MANDATORY): Title filter, dedup, liveness check, and upload:
python3 skills/job-scan/scripts/dedup_liveness_upload.py skills/job-scan/candidate_store.json
This script runs four steps in sequence:
NEVER use --skip-liveness — dead URLs will get uploaded to Notion.
python3 skills/job-scan/scripts/scout_specials.py --company anthropic
python3 skills/job-scan/scripts/scout_specials.py --dry-run
node skills/job-scan/scripts/liveness_helpers/check-liveness.mjs https://job-boards.greenhouse.io/company/jobs/123
Each new offer is added as a row in the Notion applications database with:
Company_Name, Role, URL, Date populatedStatus = "Scanned"Portal Scan — 2026-04-15 ← scout_specials.py output
━━━━━━━━━━━━━━━━━━━━━━━━━━
Companies scanned: 14
Total jobs found: 2103
Filtered: 2089 removed ← hours filter only (default: 24h)
Intra-scan dupes: 0 skipped
New offers added: 14
Jobright: 8 jobs processed, 7 resolved to ATS URLs, 1 fallback ← resolve_jobright.py output
Loaded 122 candidates from candidate_store.json ← dedup_liveness_upload.py output
Title filter: 27 positive, 31 negative keywords
After title filter: 107 pass, 15 filtered out
After dedup: 97 new, 10 duplicates
Liveness: 48 expired, 49 active/uncertain pass through
Uploaded 49 jobs to Notion (status: Scanned)
→ Run job-eval on new offers to score them.
skills/job-scan/
├── SKILL.md # This file
├── candidate_store.json # Staging file for collect → filter → dedup → liveness → upload
├── jobright_raw.json # Temporary: raw Jobright extraction (deleted after resolve)
└── scripts/
├── scout_specials.py # Portal scanner orchestrator
├── resolve_jobright.py # Jobright data normalizer + candidate store writer
├── dedup_liveness_upload.py # Title filter → dedup → liveness → upload to Notion
├── jobright_helpers/
│ └── extract-jobright.mjs # Browser-side JS functions for Jobright extraction
├── api_helpers/
│ ├── api_job_fetcher.py # Parallel fetch from ATS APIs
│ ├── api_parsers.py # Board-specific parsers (Greenhouse/Ashby/Lever/Workday)
│ └── api_resolver.py # URL → API endpoint resolver
└── liveness_helpers/
├── check-liveness.mjs # Playwright URL liveness checker
└── liveness-core.mjs # Shared liveness classification logic
scripts/notion/ # Shared Notion scripts (at repo root)
├── config.py # Loads all IDs from .env
├── notion_client.py # HTTP primitives
├── db_applications.py # Applications DB (add, query, update, dedup)
├── db_companies.py # Companies DB (load by category)
├── page_reader.py # Page block fetcher
└── page_preferences.py # Preferences parser (title filter, search queries)
npm install playwright && npx playwright install chromium (for liveness checks only)documentation
Translate visa application documents (images) to English and create a bilingual PDF with original and translation
development
A comprehensive verification system for Claude Code sessions.
development
Use this skill when writing new features, fixing bugs, or refactoring code. Enforces test-driven development with 80%+ coverage including unit, integration, and E2E tests.
tools
SwiftUI architecture patterns, state management with @Observable, view composition, navigation, performance optimization, and modern iOS/macOS UI best practices.