Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

tusosos/article-extractor

Name: article-extractor
Author: tusosos

skills/article-extractor/SKILL.md

npx skillsauth add tusosos/manus-knowledge-base article-extractor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

Provides an article/blog URL and wants the text content
Asks to "download this article"
Wants to "extract the content from [URL]"
Asks to "save this blog post as text"
Needs clean article text without distractions

How It Works

Priority Order:

Check if tools are installed (reader or trafilatura)
Download and extract article using best available tool
Clean up the content (remove extra whitespace, format properly)
Save to file with article title as filename
Confirm location and show preview

Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability)

command -v reader

If not installed:

npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli

Option 2: trafilatura (Python-based, very good)

command -v trafilatura

If not installed:

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods

Method 1: Using reader (Best for most articles)

# Extract article
reader "URL" > article.txt

Pros:

Based on Mozilla's Readability algorithm
Excellent at removing clutter
Preserves article structure

Method 2: Using trafilatura (Best for blogs/news)

# Extract article
trafilatura --URL "URL" --output-format txt > article.txt

# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

Very accurate extraction
Good with various site structures
Handles multiple languages

Options:

--no-comments: Skip comment sections
--no-tables: Skip data tables
--precision: Favor precision over recall
--recall: Extract more content (may include some noise)

Method 3: Fallback (curl + basic parsing)

# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
        self.current_tag = None

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                self.in_content = True
        self.current_tag = tag

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback):

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

Clean title for filesystem:

# Get title
TITLE="Article Title from Website"

# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

# Add extension
FILENAME="${FILENAME}.txt"

Complete Workflow

ARTICLE_URL="https://example.com/article"

# Check for tools
if command -v reader &> /dev/null; then
    TOOL="reader"
    echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
    TOOL="trafilatura"
    echo "Using trafilatura"
else
    TOOL="fallback"
    echo "Using fallback method (may be less accurate)"
fi

# Extract article
case $TOOL in
    reader)
        # Get content
        reader "$ARTICLE_URL" > temp_article.txt

        # Get title (first line after # in markdown)
        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
        ;;

    trafilatura)
        # Get title from metadata
        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")

        # Get clean content
        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
        ;;

    fallback)
        # Get title
        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
        TITLE=${TITLE%% - *}  # Remove site name
        TITLE=${TITLE%% | *}  # Remove site name (alternate)

        # Get content (basic extraction)
        curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main'}:
                self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\n')

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
        ;;
esac

# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"

# Move to final filename
mv temp_article.txt "$FILENAME"

# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"

Error Handling

Common Issues

1. Tool not installed

Try alternate tool (reader → trafilatura → fallback)
Offer to install: "Install reader with: npm install -g reader-cli"

2. Paywall or login required

Extraction tools may fail
Inform user: "This article requires authentication. Cannot extract."

3. Invalid URL

Check URL format
Try with and without redirects

4. No content extracted

Site may use heavy JavaScript
Try fallback method
Inform user if extraction fails

5. Special characters in title

Clean title for filesystem
Remove: /, :, ?, ", <, >, |
Replace with - or remove

Output Format

Saved File Contains:

Article title (if available)
Author (if available from tool)
Main article text
Section headings
No navigation, ads, or clutter

What Gets Removed:

Navigation menus
Ads and promotional content
Newsletter signup forms
Related articles sidebars
Comment sections (optional)
Social media buttons
Cookie notices

Tips for Best Results

1. Use reader for most articles

Best all-around tool
Based on Firefox Reader View
Works on most news sites and blogs

2. Use trafilatura for:

Academic articles
News sites
Blogs with complex layouts
Non-English content

3. Fallback method limitations:

May include some noise
Less accurate paragraph detection
Better than nothing for simple sites

4. Check extraction quality:

Always show preview to user
Ask if it looks correct
Offer to try different tool if needed

Example Usage

Simple extraction:

# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"

With error handling:

if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

✅ Always show preview after extraction (first 10 lines)
✅ Verify extraction succeeded before saving
✅ Clean filename for filesystem compatibility
✅ Try fallback method if primary fails
✅ Inform user which tool was used
✅ Keep filename length reasonable (< 100 chars)

After Extraction

Display to user:

"✓ Extracted: [Article Title]"
"✓ Saved to: [filename]"
Show preview (first 10-15 lines)
File size and location

Ask if needed:

"Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
"Should I extract another article?"

tusosos/article-extractor

skills/article-extractor/SKILL.md

Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.

documentation

Updated Apr 21, 2026

$ install --global

skillsauth

npx skillsauth add tusosos/manus-knowledge-base article-extractor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 1, 2026, 11:52 PM53.3s1 file scanned

SKILL.md

name:: article-extractor
description:: Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.
allowed-tools:: Bash,Write

Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

Provides an article/blog URL and wants the text content
Asks to "download this article"
Wants to "extract the content from [URL]"
Asks to "save this blog post as text"
Needs clean article text without distractions

How It Works

Priority Order:

Check if tools are installed (reader or trafilatura)
Download and extract article using best available tool
Clean up the content (remove extra whitespace, format properly)
Save to file with article title as filename
Confirm location and show preview

Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability)

command -v reader

If not installed:

npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli

Option 2: trafilatura (Python-based, very good)

command -v trafilatura

If not installed:

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods

Method 1: Using reader (Best for most articles)

# Extract article
reader "URL" > article.txt

Pros:

Based on Mozilla's Readability algorithm
Excellent at removing clutter
Preserves article structure

Method 2: Using trafilatura (Best for blogs/news)

# Extract article
trafilatura --URL "URL" --output-format txt > article.txt

# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

Very accurate extraction
Good with various site structures
Handles multiple languages

Options:

--no-comments: Skip comment sections
--no-tables: Skip data tables
--precision: Favor precision over recall
--recall: Extract more content (may include some noise)

Method 3: Fallback (curl + basic parsing)

# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
        self.current_tag = None

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                self.in_content = True
        self.current_tag = tag

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback):

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

Clean title for filesystem:

# Get title
TITLE="Article Title from Website"

# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

# Add extension
FILENAME="${FILENAME}.txt"

Complete Workflow

ARTICLE_URL="https://example.com/article"

# Check for tools
if command -v reader &> /dev/null; then
    TOOL="reader"
    echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
    TOOL="trafilatura"
    echo "Using trafilatura"
else
    TOOL="fallback"
    echo "Using fallback method (may be less accurate)"
fi

# Extract article
case $TOOL in
    reader)
        # Get content
        reader "$ARTICLE_URL" > temp_article.txt

        # Get title (first line after # in markdown)
        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
        ;;

    trafilatura)
        # Get title from metadata
        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")

        # Get clean content
        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
        ;;

    fallback)
        # Get title
        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
        TITLE=${TITLE%% - *}  # Remove site name
        TITLE=${TITLE%% | *}  # Remove site name (alternate)

        # Get content (basic extraction)
        curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main'}:
                self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\n')

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
        ;;
esac

# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"

# Move to final filename
mv temp_article.txt "$FILENAME"

# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"

Error Handling

Common Issues

1. Tool not installed

Try alternate tool (reader → trafilatura → fallback)
Offer to install: "Install reader with: npm install -g reader-cli"

2. Paywall or login required

Extraction tools may fail
Inform user: "This article requires authentication. Cannot extract."

3. Invalid URL

Check URL format
Try with and without redirects

4. No content extracted

Site may use heavy JavaScript
Try fallback method
Inform user if extraction fails

5. Special characters in title

Clean title for filesystem
Remove: /, :, ?, ", <, >, |
Replace with - or remove

Output Format

Saved File Contains:

Article title (if available)
Author (if available from tool)
Main article text
Section headings
No navigation, ads, or clutter

What Gets Removed:

Navigation menus
Ads and promotional content
Newsletter signup forms
Related articles sidebars
Comment sections (optional)
Social media buttons
Cookie notices

Tips for Best Results

1. Use reader for most articles

Best all-around tool
Based on Firefox Reader View
Works on most news sites and blogs

2. Use trafilatura for:

Academic articles
News sites
Blogs with complex layouts
Non-English content

3. Fallback method limitations:

May include some noise
Less accurate paragraph detection
Better than nothing for simple sites

4. Check extraction quality:

Always show preview to user
Ask if it looks correct
Offer to try different tool if needed

Example Usage

Simple extraction:

# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"

With error handling:

if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

✅ Always show preview after extraction (first 10 lines)
✅ Verify extraction succeeded before saving
✅ Clean filename for filesystem compatibility
✅ Try fallback method if primary fails
✅ Inform user which tool was used
✅ Keep filename length reasonable (< 100 chars)

After Extraction

Display to user:

"✓ Extracted: [Article Title]"
"✓ Saved to: [filename]"
Show preview (first 10-15 lines)
File size and location

Ask if needed:

"Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
"Should I extract another article?"

Related Skills

tusosos/yt-dlp

tools

VerifiedTrustedCommunity

Download video and audio from YouTube and other platforms with yt-dlp. Use when a user asks to download YouTube videos, extract audio from videos, download playlists, get subtitles, download specific formats or qualities, batch download, archive channels, extract metadata, embed thumbnails, download from social media platforms (Twitter, Instagram, TikTok), or build media ingestion pipelines. Covers format selection, audio extraction, playlists, subtitles, metadata, and automation.

SKILL.mdUpdated Apr 21, 2026

tusosos/youtube-downloader

development

VerifiedTrustedCommunity

Download YouTube videos with customizable quality and format options. Use this skill when the user asks to download, save, or grab YouTube videos. Supports various quality settings (best, 1080p, 720p, 480p, 360p), multiple formats (mp4, webm, mkv), and audio-only downloads as MP3.

SKILL.mdUpdated Apr 21, 2026

tusosos/youtube-downloader

tusosos/xlsx

development

VerifiedTrustedCommunity

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

SKILL.mdUpdated Apr 21, 2026

tusosos/writing-plans

development

VerifiedTrustedCommunity

Use when you have a spec or requirements for a multi-step task, before touching code

SKILL.mdUpdated Apr 21, 2026

tusosos/writing-plans

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/tusosos/manus-knowledge-base.git

# Copy into Claude Code skills folder (global)
cp -r manus-knowledge-base/skills/article-extractor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

tusosos/manus-knowledge-base

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT