skills/notebooklm-knowledge-base-organizer/SKILL.md
Use when preparing files for NotebookLM, organizing documents into a knowledge base, converting formats for NotebookLM compatibility, or reducing a large document collection to fit NotebookLM's 50-source limit. Scores and prioritizes sources, performs strategic merging (time-series, topic-based, format consolidation), converts unsupported formats (PPTX to PDF, XLSX to CSV), applies flat structure with descriptive snake_case names, and optimizes for RAG retrieval performance.
npx skillsauth add agmangas/agent-skills notebooklm-knowledge-base-organizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Prepares files for optimal use in NotebookLM by intelligently selecting and consolidating sources, converting formats, organizing structure, and ensuring compatibility. The primary constraint is NotebookLM's 50-source limit per notebook. When collections exceed this limit, systematic scoring, prioritization, and strategic merging reduce source count without losing valuable information.
Supported:
Convert These:
Per Source:
Per Notebook (Free):
Prefer many smaller, focused documents over few large ones for better RAG retrieval. The 50-source limit is the primary optimization constraint.
IMPORTANT: Preserve original file timestamps during all operations. Timestamps
are essential for understanding latest additions, recent meeting minutes, and
key decisions. Use touch -r original converted after conversions. Include
dates in ISO format (YYYY-MM-DD) in all filenames.
Prepare these files for NotebookLM - convert formats and organize with descriptive names
Convert all PPTX and XLSX files to NotebookLM-compatible formats
Check if any files exceed NotebookLM's 500k word or 200MB limits
Organize this research folder for a NotebookLM knowledge base
Find duplicate content across different file formats
Split this large PDF into NotebookLM-compatible chunks
When a user requests NotebookLM organization, follow these steps.
Count and evaluate before proceeding with any organization.
total_sources=$(find . -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.txt" -o -name "*.md" -o -name "*.csv" \) | wc -l)
echo "Total sources found: $total_sources"
If total exceeds 50:
Score all sources using the 4-dimension rubric (Relevance, Recency, Uniqueness, Density, each 0-10). See references/scoring-system.md for the full rubric, assessment commands, and batch scoring script.
Rank and select top candidates using the decision matrix. Target 35-40 auto-keep sources initially. See references/prioritization-strategy.md for the selection process and space-based adjustments.
Identify merge candidates -- find time-series patterns, topic clusters, and multi-format duplicates:
# Time-series opportunities
find . -name "*_20[0-9][0-9]_[0-9][0-9]_*" | \
sed 's/_20[0-9][0-9]_[0-9][0-9]_[0-9][0-9]//' | sort | uniq -c | sort -rn
# Topic clusters
find . -type f -name "*.pdf" | xargs -I {} basename {} .pdf | \
sed 's/_part_[0-9]*//;s/_[0-9][0-9]*$//' | sort | uniq -c | sort -rn | awk '$1 > 2'
Execute strategic merges using appropriate patterns. See references/merging-strategies.md for time-series, topic-based, and format consolidation scripts. Preserve timestamps on all merged outputs.
Recount and validate the final total is at or below 50 (ideally 48 to reserve slots for future additions).
Ask clarifying questions:
Review files for NotebookLM compatibility:
find . -type f -exec file {} \;
find . -type f -exec du -h {} \; | sort -rh
find . -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
for f in *.pdf; do pdftotext "$f" - | wc -w; done
Categorize findings:
PowerPoint to PDF:
soffice --headless --convert-to pdf *.pptx
touch -r original.pptx converted.pdf # Preserve timestamp
Excel to CSV:
soffice --headless --convert-to csv:"Text - txt - csv (StarCalc)":44,34,UTF8 *.xlsx
touch -r original.xlsx converted.csv # Preserve timestamp
Scanned PDF to Searchable:
ocrmypdf input.pdf output_searchable.pdf
touch -r input.pdf output_searchable.pdf # Preserve timestamp
pdftotext output_searchable.pdf - | wc -w # Verify text extraction
WARNING: Always run touch -r original converted after every conversion to preserve the original file timestamp.
Use this pattern: category_topic_descriptor_YYYY_MM_DD.ext
Examples:
research_quantum_computing_basics_2025.pdfmeeting_notes_project_kickoff_2026_01_15.txtclient_proposal_acme_corp_final.docxreference_api_documentation_v2.mddata_sales_figures_q4_2025.csvSee references/organization-scripts.md for the automated naming script. Preserve timestamps when renaming: use mv (preserves by default) and verify with stat.
For files >500k words or >200MB:
pdftotext document.pdf - | wc -w # Check word count
pdftk large.pdf cat 1-500 output large_part_1.pdf
pdftk large.pdf cat 501-1000 output large_part_2.pdf
touch -r large.pdf large_part_1.pdf large_part_2.pdf # Preserve timestamps
Name parts by content, not arbitrary numbers:
annual_report_2025_part_1_executive_summary.pdfannual_report_2025_part_2_financials.pdfannual_report_2025_part_3_appendices.pdfPerform strategic merging to optimize source count. This step is critical when merge candidates were identified in Step 1 or the collection is near the 50-source limit.
Merging is a primary optimization strategy, not a last resort. Three patterns apply:
See references/merging-strategies.md for full merge patterns, scripts (time-series merger, topic-based PDF merger), decision trees, and quality checks.
IMPORTANT: Preserve chronological timestamps in merged content. Add clear date headers within merged files so temporal context is not lost.
Log all merge decisions for inclusion in the organization plan.
NotebookLM works best with flat source lists, no nested folders.
Before:
docs/
project/
planning/
requirements.pdf
research/
background.pdf
reference/
api_docs.pdf
After:
notebooklm_sources/
project_requirements_2026.pdf
project_background_research.pdf
reference_api_documentation.pdf
See references/organization-scripts.md for the implementation script. Preserve timestamps when copying: use cp -p to maintain original dates.
find . -type f -exec md5 {} \; | sort | uniq -d
find . -type f -printf '%f\n' | sed 's/\.[^.]*$//' | sort | uniq -d
for pdf in *.pdf; do echo "=== $pdf ==="; pdftotext "$pdf" - | md5; done | sort
Decision matrix:
NotebookLM uses RAG, which works best with focused documents:
Instead of:
company_handbook_500_pages.pdf
Create:
handbook_code_of_conduct.pdf
handbook_benefits_overview.pdf
handbook_time_off_policy.pdf
handbook_remote_work_guidelines.pdf
handbook_career_development.pdf
Present a plan to the user before making changes. The plan should cover current state, source selection strategy (if >50 sources), proposed structure, changes to make, and a compatibility check.
See references/organization-plan-template.md for the full template with sections for prioritization results, merge decisions, and final source count verification.
After user approval, execute all conversions, merges, renames, and structural changes. Log all operations.
See references/organization-scripts.md for the complete execution script with logging and limit verification. Run touch -r after every file operation to preserve original timestamps.
Provide the user with a summary of organized sources and upload instructions for NotebookLM (direct upload and Google Drive options).
See references/upload-guide.md for the full upload instructions template including maintenance guidance.
User: "Prepare my PhD research papers folder for NotebookLM"
Process:
smith_2024.pdf to research_quantum_entanglement_smith_2024.pdfphd_research_sources/User: "Convert our company wiki exports to NotebookLM format"
Split single 145-page PDF by section into 7 focused sources:
company_overview_history_mission.pdf (8 pages)company_policies_hr_guidelines.pdf (28 pages)company_product_documentation.pdf (45 pages)Result: 7 focused sources instead of 1 large doc. Better RAG retrieval.
User: "I have 10 Excel files with research data"
Convert each sheet to separate CSV. Name descriptively: data_survey_responses_2025.csv. Create overview doc: data_overview_methodology.txt. Preserve timestamps on all conversions.
Result: 10 XLSX to 23 CSV files + 1 overview doc.
User: "Organize my conference materials for a knowledge base"
Input: 12 MP3 recordings, 8 PPTX decks, 15 JPG notes, 5 PDFs. Keep MP3 as-is (NotebookLM transcribes on upload). Convert PPTX to PDF. Keep JPGs (NotebookLM reads handwriting via OCR). Apply naming: conf_session_title_speaker_date.ext. Preserve all timestamps.
Result: 40 sources in flat folder.
For a complete workflow handling 200+ sources (e.g., reducing 237 sources to 48 with strategic merging), see references/large-collection-workflow.md.
research_[topic]_[author]_[year].pdf
notes_[course]_[topic]_[date].md
textbook_[subject]_chapter_[n]_[title].pdf
project_[name]_requirements.pdf
project_[name]_timeline.csv
meeting_[project]_[date]_notes.txt
client_[name]_proposal_final.docx
course_[name]_lecture_[n]_[topic].pdf
course_[name]_readings_week_[n].pdf
course_[name]_assignment_[n].docx
article_[topic]_[author]_[date].pdf
book_notes_[title]_[author].md
tutorial_[skill]_[topic].pdf
reference_[tool]_documentation.pdf
Optimize for Search: Use descriptive names with search keywords.
Good: tutorial_python_async_programming_advanced.pdf.
Bad: tutorial_5.pdf.
Topic-Based Splitting: Split large docs by topic, not arbitrary page count.
Good: handbook_benefits.pdf, handbook_policies.pdf.
Bad: handbook_part_1.pdf, handbook_part_2.pdf.
Date Formatting: Use ISO format (YYYY-MM-DD) for sortability.
Good: meeting_notes_2026_02_04.txt.
Bad: meeting_notes_feb_4_2026.txt.
Preserve Source Timestamps: Always maintain original file creation/modification dates. These enable accurate recency scoring and help NotebookLM's RAG weight recent meeting notes, decisions, and additions appropriately. Use touch -r original converted after every conversion.
Extract Text from Scans: Scanned PDFs do not work in NotebookLM. Test with pdftotext test.pdf - | head. If blank, run ocrmypdf input.pdf output.pdf.
Use Prefixes for Ordering: Add numeric prefixes for logical ordering: 01_project_overview.pdf, 02_project_requirements.pdf.
Test Before Bulk Upload: Upload 2-3 files first to verify processing, summaries, and search accuracy. Then upload the rest.
Source Selection and Optimization:
File Naming:
Format Selection:
Timestamp Preservation:
touch -r original converted after every conversioncp -p when copying files to preserve modification datesOrganization Structure:
Phase 1: Assessment and Prioritization
Phase 2: Conversion and Organization
Phase 3: Upload and Verification
development
Use when the user wants an AI coding agent to offload suitable low-risk, bounded codebase browsing, inventory, extraction, log triage, or simple single-file reasoning tasks to a local LM Studio model while keeping high-level reasoning and final decisions in the main model.
development
Use when the user explicitly asks for plain language, less jargon, a concise explanation, mentor-style codebase guidance, or an explanation for a developer who knows software engineering but is new to the project or domain.
tools
Analyze git history for commit style, stage changes logically, and commit without pushing. Use when the user wants to commit changes matching their repository's existing style.
development
Improve code quality in a repository using desloppify. Use when auditing a codebase, raising code quality scores, cleaning up maintainability issues, or systematically working through desloppify findings.