skills/douyin-scraper/SKILL.md
Use when a user provides Douyin/抖音 links, v.douyin.com short links, asks to fetch 抖音 video text, likes, collections/favorites, video visual content, or wants to fill a Lark Base table from 抖音 links.
npx skillsauth add csfuwwc/md-skills douyin-scraperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for 抖音公开视频链接抓取和飞书 Base 回填. Default to public, non-logged-in access. Use a logged-in Chrome profile only as a low-frequency fallback after user consent.
Use the dedicated Douyin browser profile and CDP endpoint:
http://[::1]:9222$HOME/Library/Application Support/Google/DouyinChrome~/.agents/social-browser-profiles/launch-social-chrome.sh douyinDo not use the default Chrome profile for Douyin scraping or commenting. Do not share this port/profile with Xiaohongshu or Weibo jobs.
python3 ~/.agents/skills/douyin-scraper/scripts/scrape-douyin.py --json < urls.txt
node ~/.agents/skills/douyin-scraper/scripts/process-lark-douyin.mjs \
--base-token <base_token> \
--table-id <table_id> \
--view-id <view_id> \
--batch-size 10
node ~/.agents/skills/douyin-scraper/scripts/process-lark-douyin.mjs \
--base-token <base_token> \
--table-id <table_id> \
--view-id <view_id> \
--browser-confirm \
--chrome-user-data-dir "$HOME/Library/Application Support/Google/DouyinChrome" \
--batch-size 3
Do not wrap the scraper in a shell loop. Batch mode must reuse one browser context for the whole batch and should write each row back to Lark immediately after that row's scrape result is produced.
When --browser-confirm is used, still run public access first; retry only the failed rows with the logged-in Chrome profile.
For repeated visible-browser diagnostics or fallback, prefer a long-lived Chrome opened with remote debugging and pass --cdp-url 'http://[::1]:9222' plus --cdp-only when the user explicitly wants all rows to use that visible browser. Do not repeatedly open and close Chrome.
Important: login state belongs to the Chrome profile that is open. A cookie file saved by another helper or a different --user-data-dir is not the same as the visible CDP browser profile. When using --cdp-url, do not inject saved cookie files into that browser context; trust the live profile's own cookies. If a new CDP profile is used, the user may need to log in there once, then keep that browser open.
If direct navigation in the first tab triggers login friction but manual opening a link in a new tab works, use --new-tab with --cdp-url. This creates a dedicated tab inside the same live Chrome profile and leaves the user's existing login tab untouched.
In CDP mode, --new-tab means "use the reusable worker tab", not "create a fresh tab every run". The scraper finds a tab whose window.name is codex-douyin-worker; if none exists, it creates one and then keeps reusing it by replacing the URL. Use --worker-tab-name <name> only when intentionally running a separate isolated Douyin worker.
For logged-in comment execution, use the same long-lived CDP browser and add --comment. This posts only when all conditions are true: 抓取状态=保持抓取, 生成评论 is non-empty, and 评论状态 is blank or 准备评论. Never comment rows already marked 评论成功, 取消评论, or 评论失败.
The Base entrypoint uses a local platform lock at /tmp/social-scraper-locks/douyin.lock; do not run two Douyin jobs at the same time from different sessions.
Use 抓取状态 as the control plane:
准备抓取.准备抓取: first scrape. Fetch 正文, 点赞数, 收藏数, and 抓取时间. For videos, add visual-frame analysis to 正文.保持抓取: refresh only 点赞数, 收藏数, and 抓取时间. Do not fetch or overwrite 正文 by default.抓取异常: confirmed deleted, unavailable, non-existent, login-blocked after allowed fallback, or unsafe to retry.停止抓取: manual or rule-based stop state for rows that no longer need engagement tracking.Default runs process rows whose 抓取状态 is blank, 准备抓取, or 保持抓取. On successful 准备抓取, write body and engagement fields, then set 抓取状态=保持抓取. On successful 保持抓取, update only engagement fields and 抓取时间. On confirmed terminal failure, set 抓取状态=抓取异常, write 抓取时间, and preserve existing useful values. 停止抓取 is not set automatically unless a separate stop-tracking rule is explicitly enabled.
Use 评论状态 as a separate operations queue:
准备评论: selected by threshold or manual review and waiting for comment execution.取消评论: manually decided not to comment.评论成功: comment has been posted.评论失败: a comment attempt failed and is not currently being retried.The table does not need a 评论方式 field unless platform behavior becomes mixed. Keep 评论状态 blank by default; use 准备评论 only when a row has been selected for commenting. Douyin comments can be posted automatically from the logged-in CDP browser with:
node ~/.agents/skills/douyin-scraper/scripts/process-lark-douyin.mjs \
--base-token <base_token> \
--table-id <table_id> \
--view-id <view_id> \
--cdp-url 'http://[::1]:9222' \
--cdp-only \
--new-tab \
--worker-tab-name codex-douyin-worker \
--comment \
--batch-size 3
Comment automation uses page DOM controls inside #comment-input-container, fills .public-DraftEditor-content[contenteditable=true], clicks the publish control inside the same container, then verifies the comment text appears before writing 评论成功. On failure it writes 评论失败. Do not use coordinate clicks in production batches.
正文: original Douyin caption/title first. For videos, append:
【视频内容解析】
...
点赞数: Douyin digg_count when available.收藏数: Douyin collect_count when available.点赞数 / 评论数 / 收藏数 / 分享; do not treat the second visible number as 收藏数 when three metrics are present.赞 or 收藏 with no number, write 0 for that metric.抓取时间: use local datetime rounded to minutes, e.g. YYYY-MM-DD HH:mm:00.For 准备抓取 video rows:
/video/<aweme_id> from the resolved page URL and use only the matching aweme record's video.play_addr/download_addr stream URL.ffmpeg to create a contact sheet from representative frames.mmx vision describe --region global when available to describe visible captions, scene, product/person/place details, and narrative.Do not use the first network-captured douyinvod response as the video source. In a reused browser page it can belong to the previous item, recommendation feed, or preloaded media. If a current-aweme-bound video URL is unavailable, skip video analysis and still write the caption/engagement values.
Do not claim audio transcript accuracy unless an ASR tool actually processed the audio.
references/strategy.md before changing status rules, browser fallback, video parsing, or Base write semantics.tools
Use when the user asks to install Feishu/Lark CLI, configure lark-cli, connect an agent with Feishu CLI, check or refresh lark-cli auth, recover expired tokens, or start a Feishu device-flow login.
documentation
Use when a user provides Xiaohongshu/XHS/xhslink URLs, asks to fetch 小红书 note or video content, likes, saves/collections, comments, publish metadata, or wants to fill a spreadsheet/Base from 小红书 links.
content-media
Use when a user provides Weibo/微博 links, asks to fetch 微博 post text, likes, video visual content, or wants to fill a Lark Base table from 微博 links.
testing
Read and send email via IMAP/SMTP. Check for new/unread messages, fetch content, search mailboxes, mark as read/unread, and send emails with attachments. Supports multiple accounts. Works with any IMAP/SMTP server including Gmail, Outlook, 163.com, vip.163.com, 126.com, vip.126.com, 188.com, and vip.188.com.