Deep Research Report Processor

Processes deep research reports (from ChatGPT, Claude, or other platforms) generated by the deep-research-genelist skill. Cleans platform-specific artifacts, generates publication-quality PDF and/or HTML, extracts the YAML header, and maintains a growing annotation summary table.

Workflow

Step 1: Identify Input

Accept one of:

A specific file path provided by the user
"all" — glob outs/deep_research*/**/*.md (note the * after deep_research — catches deep_research/, deep_research_platynereis/, etc.), skip files ending in _clean.md, _raw.md, or _prompt.md
"new" — glob outs/deep_research*/**/*.md, same exclusions, then keep only files that don't yet have a corresponding _clean.md

If no outs/deep_research*/ directories exist, tell the user and stop.

Directory structure

Each module gets its own subdirectory under outs/deep_research/:

outs/deep_research/
├── annotation_summary.tsv          ← cross-cluster summary (top level)
├── clade6sub25/                    ← per-module subdirectory
│   ├── clade6sub25_prompt.md       ← shared prompt (no date/platform prefix)
│   ├── 260304_chatgpt_clade6sub25_raw.md
│   ├── 260304_chatgpt_clade6sub25_clean.md
│   ├── 260304_chatgpt_clade6sub25_report.pdf
│   ├── 260304_chatgpt_clade6sub25_report.html
│   ├── 260304_claude_clade6sub25_clean.md
│   ├── 260304_claude_clade6sub25_report.pdf
│   └── 260304_claude_clade6sub25_report.html
└── another_cluster/
    └── ...

Create the subdirectory outs/deep_research/{module_id}/ if it doesn't exist. All output files for that module go inside it.

Step 2: Detect Platform and Clean

Read the file as raw bytes. Detect the platform from artifact signatures:

| Signal | Platform | |--------|----------| | PUA chars (U+E000–U+F8FF) present | chatgpt (Deep Research mode) | | entity[ or citeturn in text after PUA stripping | chatgpt (Deep Research mode) | | None of the above | Ask user (see below) |

Artifact-free reports (no PUA, no entity tags) could be from Claude OR ChatGPT Pro (extended thinking). These two platforms produce clean output with no artifacts. When no artifacts are detected, ask the user which platform generated the report using AskUserQuestion:

chatgpt_pro — ChatGPT Pro / extended thinking / o-series models
claude — Claude (Anthropic)

This matters for the canonical filename (260330_chatgpt_pro_family_C vs 260330_claude_family_C) and for the summary table platform column.

Report detection to user: "Detected {platform} report." (or "No artifacts detected — asking platform.")

Parse YAML header first

Before cleaning, extract the YAML front matter (between first --- line and next --- line) to get module_id and date_generated. These are needed for the canonical filename.

Construct canonical base name

Pattern depends on report type:

Cluster reports: YYMMDD_platform_moduleID (e.g., 260304_chatgpt_clade6sub25)
Family reports: YYMMDD_platform_family_FAMILYID (e.g., 260330_claude_family_C)
Nonmetazoan: YYMMDD_platform_kingdom (e.g., 260315_chatgpt_prokaryote)

Components:

YYMMDD from query.date_generated (format: YYYY-MM-DD → YYMMDD). Fall back to today's date.
platform: chatgpt, chatgpt_pro, or claude (lowercase)
moduleID: from query.module_id (cluster reports)
FAMILYID: from query.family (family reports — note: NOT query.module_id)
kingdom: from query.kingdom (nonmetazoan reports)

Platform-specific cleaning

ChatGPT reports (have artifacts):

Cleaning order is critical — PUA characters must be stripped first or other patterns won't match.

import re

# 1. Strip all PUA characters (U+E000–U+F8FF) — MUST be first
text = re.sub(r'[\ue000-\uf8ff]', '', text)

# 2. Replace entity["type","name","desc"] → name
text = re.sub(r'entity\["[^"]*","([^"]*)","[^"]*"\]', r'\1', text)

# 3. Remove citeturn... strings (the [N] refs are already inline)
text = re.sub(r'\s*citeturn\S+', '', text)

# 4. Remove image_group{...} lines
text = re.sub(r'^.*image_group\{.*\}.*$', '', text, flags=re.MULTILINE)

# 5. Normalize whitespace
text = re.sub(r'  +', ' ', text)           # collapse double spaces
text = re.sub(r'\n{3,}', '\n\n', text)     # collapse triple+ newlines
text = re.sub(r' +$', '', text, flags=re.MULTILINE)  # strip trailing whitespace

Output files:

Rename raw file to {base}_raw.md (preserves original with artifacts for reference)
Write cleaned text to {base}_clean.md

Report artifact counts: "Removed X entity tags, Y citation markers, Z PUA characters."

Claude reports (already clean):

Claude deep research outputs have no known artifacts (verified: zero PUA chars, zero entity tags, zero citeturn markers).

Rename the original file directly to {base}_clean.md (no separate _raw.md since there are no artifacts to preserve)
Apply only whitespace normalization (collapse blank lines, strip trailing spaces)

Report: "Claude report — no artifacts detected. Renamed to {base}_clean.md."

Step 3: Detect Report Type and Validate YAML Header

Parse the YAML front matter from the clean file using Python yaml.safe_load().

Report type detection: Check query.report_type to determine which parser to use:

"cluster" (or absent) → cluster cell-type-annotation parser
"family" → family cell-type-annotation parser (see Step 3c below)
"nonmetazoan_characterization" → nonmetazoan parser (see Step 3b below)

ChatGPT YAML indentation fix: ChatGPT inconsistently outputs YAML with 1-space or 2-space indentation across runs. 1-space indent is valid YAML for simple key-value mappings, but breaks for two specific constructs:

Folded block scalars (summary: >) — the continuation text lands at the same indent as sibling keys, so the parser can't distinguish block scalar content from a new key
List item mapping continuations (- cell_type: "..." followed by organism: "...") — continuation keys land at the same indent as the - marker, so they're parsed as siblings rather than part of the list item's mapping

Important: Do NOT "double all indentation" — this breaks normal mapping keys by making children indent past their siblings. Instead, apply a targeted fix that only touches the two broken patterns:

The fix: try yaml.safe_load() first. If it fails, apply fix_flat_yaml() (for zero-indent ChatGPT Pro output), then fix_chatgpt_yaml() (for 1-space indent issues), then retry.

`fix_flat_yaml` — zero-indent ChatGPT Pro / extended thinking output

ChatGPT Pro with extended thinking outputs YAML with ALL keys at column 0 (zero indent under section headers). This fixer detects and corrects this pattern before the existing fix_chatgpt_yaml runs.

def fix_flat_yaml(yaml_text):
    """Fix YAML where all keys are at column 0 (ChatGPT Pro / extended thinking).

    Detects this pattern: top-level section keys (query:, annotation:, markers:)
    followed by child keys at the same indent level. Adds proper indent to all
    child keys under each section. Distinguishes mapping list items ("- key: val"
    → continuations get 6-space indent) from simple list items ('- "value"' →
    next sibling key resets to 2-space).
    """
    lines = yaml_text.split('\n')
    # Quick check: if we see 'query:' at col 0 followed by 'organism:' at col 0,
    # this is a flat YAML that needs fixing
    has_flat_pattern = False
    for i, line in enumerate(lines):
        if line.strip() == 'query:' and i + 1 < len(lines):
            next_line = lines[i + 1]
            if next_line and not next_line.startswith(' ') and ':' in next_line:
                has_flat_pattern = True
                break
    
    if not has_flat_pattern:
        return yaml_text  # Already properly indented
    
    fixed = []
    section = None
    section_keys = {'query', 'annotation', 'markers', 'classification',
                    'functional_categories', 'hgt_candidates', 'expression_patterns',
                    'biology'}
    in_block_scalar = False
    in_mapping_list = False  # inside "- key: val" list item (continuations need 6-space)
    in_simple_list = False   # inside "- value" list (next sibling key resets to 2-space)

    for line in lines:
        stripped = line.strip()
        n_spaces = len(line) - len(line.lstrip())

        if not stripped:
            in_block_scalar = False
            in_mapping_list = False
            in_simple_list = False
            fixed.append('')
            continue

        # Block scalar continuation
        if in_block_scalar and n_spaces == 0 and not any(
            stripped.startswith(k + ':') for k in section_keys
        ):
            fixed.append('    ' + stripped)
            continue
        elif in_block_scalar:
            in_block_scalar = False

        # Top-level section key
        key_candidate = stripped.split(':')[0].strip()
        if n_spaces == 0 and key_candidate in section_keys and stripped.endswith(':'):
            section = key_candidate
            in_mapping_list = False
            in_simple_list = False
            fixed.append(line)
            continue

        # Detect block scalar start
        if stripped.endswith(': >') or stripped.endswith(': |'):
            in_block_scalar = True

        # Child content under a section at col 0
        if section and n_spaces == 0:
            if stripped.startswith('---'):
                section = None
                in_mapping_list = False
                in_simple_list = False
                fixed.append(line)
                continue
            if key_candidate in section_keys:
                section = key_candidate
                in_mapping_list = False
                in_simple_list = False
                fixed.append(line)
                continue
            # List items
            if stripped.startswith('- '):
                after_dash = stripped[2:].strip()
                if re.match(r'[\w_-]+\s*:', after_dash):
                    in_mapping_list = True
                    in_simple_list = False
                else:
                    in_simple_list = True
                    in_mapping_list = False
                fixed.append('    ' + stripped)
                continue
            # Continuation key inside a mapping list item
            if in_mapping_list and re.match(r'[\w_-]+\s*:', stripped):
                fixed.append('      ' + stripped)
                continue
            # Regular key (not inside a list, or after simple list ended)
            in_mapping_list = False
            in_simple_list = False
            fixed.append('  ' + stripped)
            continue

        # Already indented — add 2 more spaces for section nesting
        if section and n_spaces > 0:
            fixed.append('  ' + line)
            continue

        fixed.append(line)

    
    return '\n'.join(fixed)

`fix_chatgpt_yaml` — 1-space indent issues

This existing fixer handles the two specific constructs broken by 1-space indentation:

After a line ending in : > or : |, adds 1 extra space to continuation lines (non-key, non-blank lines at the same indent) until a blank line or a real key is encountered
After a line matching - key: value (list item mapping start), adds 2 extra spaces to subsequent non-- key lines at the same indent (these are continuation keys inside the mapping)
Leaves all other indentation unchanged

def fix_chatgpt_yaml(yaml_text):
    """Fix ChatGPT 1-space indent YAML: folded block scalars + list item mapping continuations."""
    lines = yaml_text.split('\n')
    fixed = []
    in_block_scalar = False
    block_scalar_key_indent = -1
    in_list_item = False
    list_item_indent = -1

    for i, line in enumerate(lines):
        stripped = line.lstrip(' ')
        n_spaces = len(line) - len(stripped)

        if stripped == '':
            in_block_scalar = False
            fixed.append(line)
            continue

        # FIX 1: Folded/literal block scalar continuation
        if in_block_scalar:
            if (n_spaces == block_scalar_key_indent
                    and not re.match(r'[\w-]+\s*:', stripped)
                    and not stripped.startswith('- ')):
                fixed.append(' ' * (n_spaces + 1) + stripped)
                continue
            else:
                in_block_scalar = False

        # FIX 2: List item mapping continuation
        if in_list_item:
            if (n_spaces == list_item_indent
                    and re.match(r'[\w_-]+\s*:', stripped)
                    and not stripped.startswith('- ')):
                fixed.append(' ' * (n_spaces + 2) + stripped)
                continue
            elif n_spaces <= list_item_indent and not (
                    n_spaces == list_item_indent and stripped.startswith('- ')):
                in_list_item = False

        if re.search(r':\s*[>|]\s*$', stripped):
            in_block_scalar = True
            block_scalar_key_indent = n_spaces

        if re.match(r'- [\w_-]+\s*:', stripped):
            in_list_item = True
            list_item_indent = n_spaces

        fixed.append(line)

    return '\n'.join(fixed)

This handles both 1-space and 2-space ChatGPT output without breaking valid YAML.

Validate required fields:

Required:

query.module_id, query.organism, query.n_genes
annotation.proposed_name, annotation.confidence, annotation.one_line
markers.top_diagnostic (non-empty list)

Optional but checked:

query.source_object, query.clustering_column, query.marker_file — warn if missing (these are provenance fields added in v2 of the genelist skill; older prompts won't have them)
query.comparison_mode, query.clade_family — present when merged marker mode was used; absent for single-list prompts or older prompts. Default to empty string in summary table.
query.report_type — "cluster" (default) or "family". Family reports are generated by the family-aware mode of the genelist skill.
query.member_clusters — list of fine cluster names (family reports only). Default to empty list.
query.n_member_clusters — integer count (family reports only). Default to 0.

If validation fails, report which fields are missing and ask whether to proceed.

Step 3b: Nonmetazoan Report Validation

When query.report_type == "nonmetazoan_characterization", use this validation instead:

Detect template variant from which classification fields are present:

Has classification.n_candidate_hgt → prokaryote variant (Bacteria, Archaea, combined)
Has classification.n_database_bias → eukaryote variant (Fungi, Viridiplantae, etc.)
Has both or neither → warn and proceed with best guess

Required fields (both variants):

query.kingdom, query.organism, query.n_genes, query.n_expressed
classification.assessment_confidence
At least one entry in functional_categories

Prokaryote-specific required:

classification.n_symbiont_transcript, classification.n_candidate_hgt, classification.n_conserved, classification.n_ambiguous (can be 0)

Eukaryote-specific required:

classification.n_database_bias, classification.n_conserved, classification.n_symbiont_transcript, classification.n_lateral_transfer, classification.n_ambiguous (can be 0)

Optional but checked:

hgt_candidates — warn if absent for prokaryote reports
expression_patterns.cell_type_enriched_genes — warn if empty
biology.symbiosis_relevant — prokaryote only
biology.evolutionary_insight — eukaryote only

Canonical base name for nonmetazoan reports: Pattern: YYMMDD_platform_kingdom (e.g., 260315_chatgpt_prokaryote, 260315_claude_fungi)

kingdom from query.kingdom, lowercased, spaces replaced with underscores

Directory: Reports go in outs/deep_research/nonmetazoan/{kingdom}/ (note: under deep_research/nonmetazoan/, not the script-04 outs/scmicrobiome/ directory — the processed reports live alongside cell-type-annotation reports).

Step 3c: Family Report Validation

When query.report_type == "family", use this validation:

Required:

query.family, query.organism, query.n_modules or query.n_member_clusters
annotation.proposed_family_name, annotation.confidence, annotation.one_line
annotation.per_cluster (non-empty list with at least cluster and proposed_name per entry)

Optional but checked:

query.subfamilies — subfamily structure. Default empty.
annotation.best_matches — comparative matches. Default empty list.
annotation.family_conservation — conservation level. Default empty.
markers.family_defining — family marker list. Warn if empty.
markers.family_specifying_tfs, markers.cluster_specifying_tfs — TF annotations.

Canonical base name: YYMMDD_platform_family_FAMILYID

Directory: Reports go in the same directory as the input file (family reports already live in their own subdirectory, e.g., outs/deep_research_platynereis/family_C/).

Step 4: Ask Output Format

Use AskUserQuestion:

Both (Recommended) — PDF for sharing, HTML for browsing with clickable DOIs
PDF only — LaTeX-rendered, good for sharing/printing
HTML only — Standalone, nice table styling, clickable links

Step 5: Generate Outputs

Create a temporary _report_for_render.md from {base}_clean.md:

Strip the data YAML header — the large query:/annotation:/markers: (or classification:/functional_categories:) block is for programmatic parsing only, not for display
Insert a pandoc formatting YAML. The title format depends on report type:
- Cell-type annotation: "Gene Module Interpretation: *{organism}* {module_id}"
- Family annotation: "Cell Type Family Report: *{organism}* Family {family}"
- Nonmetazoan: "Non-Metazoan Gene Characterization: *{organism}* — {kingdom}"
```
---
title: "<title per report type>"
subtitle: "Deep Research Report — {Platform} ({date_generated})"
geometry: margin=1in
fontsize: 11pt
header-includes:
  - \renewcommand{\arraystretch}{1.4}
  - \usepackage{booktabs}
  - \usepackage{longtable}
---
```
The \arraystretch{1.4} adds 40% vertical padding to every table row — critical for readability.
Remove the H1 title from the body (it's now in the YAML title: field)
Normalize heading levels — ChatGPT reports nest headings too deep, producing tiny headers at H4/H5 in PDFs. Apply these fixes in order:

a. Remove redundant heading pairs. ChatGPT often outputs a generic ## heading immediately followed by the lettered ### section (e.g., ## Comparative biological analysis then ### H. Comparative Biological Analysis). Remove the redundant ## line:
```
body = re.sub(r'^##\s+[^\n]+\n\n(###\s+[A-L]\.)', r'\1', body, flags=re.MULTILINE)
```
b. Shift all headings up one level. Since the H1 title was removed, the remaining hierarchy is too deep (##→####). Shift ##→#, ###→##, ####→###, etc. (never shift H1):
```
def shift_heading(m):
    hashes = m.group(1)
    rest = m.group(2)
    if len(hashes) > 1:
        return '#' * (len(hashes) - 1) + rest
    return m.group(0)
body = re.sub(r'^(#{2,})([ \t].*)$', shift_heading, body, flags=re.MULTILINE)
```
After normalization, the typical hierarchy is: # (sections A–K), ## (subsections like "Core identity modules"), ### (sub-subsections).
Escape backslashes in gene names for LaTeX. ChatGPT reports sometimes include gene names with literal backslashes (e.g., Dmel\cg5579), which LaTeX interprets as control sequences. Escape them in the body only:
```
body = re.sub(r'\\(?=[a-zA-Z])', r'\\\\', body)
```

Detect rendering tools: Find quarto on PATH or common locations (~/.local/bin/quarto, /usr/local/bin/quarto). Check for xelatex availability — if not on PATH, try module load texlive (HPC cluster). If no LaTeX available, skip PDF and inform user. On cluster, shell commands for PDF rendering must include module load texlive && prefix.

Important: When writing the pandoc YAML header from Python, write the LaTeX commands directly to the file (not through Python f-strings). The backslashes in \renewcommand, \arraystretch, \usepackage must appear literally in the markdown file — do NOT double-escape them in Python. Use f.write() with raw strings or explicit line writes.

Generate PDF:

{quarto_path} pandoc \
  "{base}_report_for_render.md" \
  -o "{base}_report.pdf" \
  --pdf-engine=xelatex \
  -V colorlinks=true -V linkcolor=blue -V urlcolor=blue

On cluster, prepend module load texlive && before the quarto command.

Generate HTML:

{quarto_path} pandoc \
  "{base}_report_for_render.md" \
  -o "{base}_report.html" \
  --standalone \
  --css="$HOME/.claude/skills/deep-research-reports/templates/report-style.css" \
  --embed-resources

The --embed-resources flag inlines the CSS so the HTML is fully self-contained.

Delete _report_for_render.md after successful conversion.

If pandoc fails (e.g., LaTeX error), fall back to HTML only and notify the user.

Step 6: Update Summary Table

Extract fields from validated YAML and update outs/deep_research/annotation_summary.tsv.

Composite key: module_id + platform + date_generated. This allows the table to hold both ChatGPT and Claude annotations for the same cluster side by side.

Logic:

Read existing TSV if it exists (using Python csv with delimiter='\t')
Check if a row with matching composite key exists
If yes: show old vs new proposed_name, ask to update or skip
If no: append new row
Sort by module_id, then platform
Write back TSV

Columns — parse ALL YAML fields into the summary table:

module_id, organism, common_name, clade, dataset, module_type, report_type, source_object, clustering_column, marker_file, comparison_mode, clade_family, member_clusters, n_member_clusters, biological_context, n_genes, proposed_name, alternative_names, confidence, confidence_rationale, one_line, summary, cell_type_family, family_conservation, top_tfs, top_markers, top_diagnostic_ids, top_diagnostic_roles, receptors_channels, signaling_ligands, adhesion_molecules, secreted_products, key_pathways, metabolic_signature, n_uncharacterized, best_match_1, best_match_2, best_match_3, date_generated, platform, report_file, date_processed

Field mapping from YAML:

| Summary column | YAML source | Format | |---------------|-------------|--------| | module_id | query.module_id | verbatim | | organism | query.organism | verbatim | | common_name | query.common_name | verbatim | | clade | query.clade | verbatim | | dataset | query.dataset | verbatim | | module_type | query.module_type | verbatim | | report_type | query.report_type | verbatim; default "cluster" | | source_object | query.source_object | verbatim | | clustering_column | query.clustering_column | verbatim | | marker_file | query.marker_file | verbatim | | comparison_mode | query.comparison_mode | verbatim; default empty | | clade_family | query.clade_family | verbatim; default empty | | member_clusters | query.member_clusters | join with ; ; default empty (family reports only) | | n_member_clusters | query.n_member_clusters | integer; default 0 (family reports only) | | biological_context | query.biological_context | verbatim | | n_genes | query.n_genes | integer | | proposed_name | annotation.proposed_name | verbatim | | alternative_names | annotation.alternative_names | join with ; | | confidence | annotation.confidence | verbatim | | confidence_rationale | annotation.confidence_rationale | verbatim | | one_line | annotation.one_line | verbatim | | summary | annotation.summary | verbatim (may be multi-line — flatten to single line) | | cell_type_family | annotation.cell_type_family | verbatim | | family_conservation | annotation.family_conservation | verbatim | | top_tfs | markers.transcription_factors | join with ; | | top_markers | markers.top_diagnostic[].name | top 5, join with ; | | top_diagnostic_ids | markers.top_diagnostic[].gene_id | top 5, join with ; | | top_diagnostic_roles | markers.top_diagnostic[].role | top 5, join with ; | | receptors_channels | markers.receptors_channels | join with ; | | signaling_ligands | markers.signaling_ligands | join with ; | | adhesion_molecules | markers.adhesion_molecules | join with ; | | secreted_products | markers.secreted_products | join with ; | | key_pathways | markers.key_pathways | join with ; | | metabolic_signature | markers.metabolic_signature | verbatim | | n_uncharacterized | markers.n_uncharacterized_notable | integer | | best_match_1 | annotation.best_matches[0] | cell_type (organism) [conservation] | | best_match_2 | annotation.best_matches[1] | cell_type (organism) [conservation] | | best_match_3 | annotation.best_matches[2] | cell_type (organism) [conservation] | | date_generated | query.date_generated | verbatim | | platform | detected platform | chatgpt or claude | | report_file | computed path | {base}_clean.md | | date_processed | current date | YYYY-MM-DD |

Fields may be absent in older reports — default to empty string.

Family report field mappings: For family reports, the summary table uses the same annotation_summary.tsv but with these field mappings:

| Summary column | Family YAML source | |---|---| | module_id | query.family (prefixed with family_, e.g., family_C) | | proposed_name | annotation.proposed_family_name | | member_clusters | query.member_clusters joined with ; | | n_member_clusters | query.n_member_clusters |

All other columns map the same as cluster reports. The per_cluster annotations are NOT added to the summary table row — they are available in the clean markdown file for detailed parsing.

Step 6b: Nonmetazoan Summary Table

For nonmetazoan reports (report_type == "nonmetazoan_characterization"), update a separate summary table at outs/deep_research/nonmetazoan_summary.tsv. Do NOT mix with annotation_summary.tsv — the column semantics are fundamentally different.

Composite key: kingdom + platform + date_generated

Columns:

| Column | Source | Format | |--------|--------|--------| | kingdom | query.kingdom | verbatim | | phylum | query.phylum | verbatim | | organism | query.organism | verbatim | | common_name | query.common_name | verbatim | | n_genes | query.n_genes | int | | n_expressed | query.n_expressed | int | | template_variant | detected from fields | "prokaryote" or "eukaryote" | | n_symbiont_transcript | classification.n_symbiont_transcript | int; 0 if absent | | n_candidate_hgt | classification.n_candidate_hgt | int; 0 if absent | | n_database_bias | classification.n_database_bias | int; 0 if absent | | n_lateral_transfer | classification.n_lateral_transfer | int; 0 if absent | | n_conserved | classification.n_conserved | int | | n_ambiguous | classification.n_ambiguous | int | | assessment_confidence | classification.assessment_confidence | str | | confidence_rationale | classification.confidence_rationale | str | | n_functional_categories | len(functional_categories) | int | | top_categories | top 5 category names by n_genes | join "; " | | top_category_origins | matching likely_origin for top 5 | join "; " | | symbiosis_relevant | biology.symbiosis_relevant | bool; empty for eukaryote | | evolutionary_insight | biology.evolutionary_insight | str; empty for prokaryote | | key_functions | biology.key_functions | join "; " | | recommended_followup | biology.recommended_followup | join "; " | | n_hgt_candidates | len(hgt_candidates.top_candidates) | int; 0 for eukaryote | | top_hgt_genes | top 3 candidate gene_ids | join "; " | | n_cell_type_enriched | len(expression_patterns.cell_type_enriched_genes) | int | | top_enriched_genes | top 5 gene_id values | join "; " | | notable_associations | expression_patterns.notable_associations | join "; " | | date_generated | query.date_generated | str | | platform | detected | "chatgpt" or "claude" | | report_file | computed path | str | | date_processed | current date | YYYY-MM-DD |

Step 7: Report Results

Cell-type-annotation reports:

Processed: clade6sub25_annotation_report.md
  Platform: Claude
  Renamed: clade6sub25/260304_claude_clade6sub25_clean.md
  PDF: outs/deep_research/clade6sub25/260304_claude_clade6sub25_report.pdf
  HTML: outs/deep_research/clade6sub25/260304_claude_clade6sub25_report.html
  Summary: annotation_summary.tsv (new row added)

  Annotation: "hemocyte-like immune/scavenger cells (GCM+ phagocytes)"
  Confidence: medium
  Cell type family: immune/scavenger (hemocyte-like)

Nonmetazoan characterization reports:

Processed: deep-research-report (24).md
  Platform: ChatGPT
  Report type: nonmetazoan_characterization (prokaryote variant)
  Cleaned: prokaryote/260315_chatgpt_prokaryote_clean.md
  PDF: prokaryote/260315_chatgpt_prokaryote_report.pdf
  HTML: prokaryote/260315_chatgpt_prokaryote_report.html
  Summary: nonmetazoan_summary.tsv (new row added)

  Kingdom: Prokaryote (369 genes, 178 expressed)
  Classification: 310 symbiont, 12 HGT, 25 conserved, 22 ambiguous
  Confidence: low
  Top HGT candidates: c102759-g4 (Protein-ADP-ribose hydrolase), c101192-g1 (Deubiquitinase)

Batch Mode

When processing multiple reports (via "all" or "new"), the skill:

Processes each report sequentially through Steps 2–6
Asks the output format question once (applies to all reports)
At the end, shows a summary table of all processed reports
Rebuilds the summary TSV from all clean files to ensure consistency

Notes

The _clean.md files are the archival versions — they retain the YAML header for parsing and are the source of truth for the summary table.
The prompt file ({module_id}_prompt.md) is shared across platforms and dates — it doesn't get the date/platform prefix.
If a report has no YAML front matter at all, report the error clearly and skip that file.
Platform detection from artifacts is deterministic: PUA chars are unique to ChatGPT's export. If future platforms introduce new artifacts, add detection patterns to templates/cleaning-patterns.md.

Deep Research Report Processor

Workflow

Step 1: Identify Input

Accept one of:

A specific file path provided by the user
"all" — glob outs/deep_research*/**/*.md (note the * after deep_research — catches deep_research/, deep_research_platynereis/, etc.), skip files ending in _clean.md, _raw.md, or _prompt.md
"new" — glob outs/deep_research*/**/*.md, same exclusions, then keep only files that don't yet have a corresponding _clean.md

If no outs/deep_research*/ directories exist, tell the user and stop.

Directory structure

Each module gets its own subdirectory under outs/deep_research/:

outs/deep_research/
├── annotation_summary.tsv          ← cross-cluster summary (top level)
├── clade6sub25/                    ← per-module subdirectory
│   ├── clade6sub25_prompt.md       ← shared prompt (no date/platform prefix)
│   ├── 260304_chatgpt_clade6sub25_raw.md
│   ├── 260304_chatgpt_clade6sub25_clean.md
│   ├── 260304_chatgpt_clade6sub25_report.pdf
│   ├── 260304_chatgpt_clade6sub25_report.html
│   ├── 260304_claude_clade6sub25_clean.md
│   ├── 260304_claude_clade6sub25_report.pdf
│   └── 260304_claude_clade6sub25_report.html
└── another_cluster/
    └── ...

Create the subdirectory outs/deep_research/{module_id}/ if it doesn't exist. All output files for that module go inside it.

Step 2: Detect Platform and Clean

Read the file as raw bytes. Detect the platform from artifact signatures:

chatgpt_pro — ChatGPT Pro / extended thinking / o-series models
claude — Claude (Anthropic)

This matters for the canonical filename (260330_chatgpt_pro_family_C vs 260330_claude_family_C) and for the summary table platform column.

Report detection to user: "Detected {platform} report." (or "No artifacts detected — asking platform.")

Parse YAML header first

Before cleaning, extract the YAML front matter (between first --- line and next --- line) to get module_id and date_generated. These are needed for the canonical filename.

Construct canonical base name

Pattern depends on report type:

Cluster reports: YYMMDD_platform_moduleID (e.g., 260304_chatgpt_clade6sub25)
Family reports: YYMMDD_platform_family_FAMILYID (e.g., 260330_claude_family_C)
Nonmetazoan: YYMMDD_platform_kingdom (e.g., 260315_chatgpt_prokaryote)

Components:

YYMMDD from query.date_generated (format: YYYY-MM-DD → YYMMDD). Fall back to today's date.
platform: chatgpt, chatgpt_pro, or claude (lowercase)
moduleID: from query.module_id (cluster reports)
FAMILYID: from query.family (family reports — note: NOT query.module_id)
kingdom: from query.kingdom (nonmetazoan reports)

Platform-specific cleaning

ChatGPT reports (have artifacts):

Cleaning order is critical — PUA characters must be stripped first or other patterns won't match.

import re

# 1. Strip all PUA characters (U+E000–U+F8FF) — MUST be first
text = re.sub(r'[\ue000-\uf8ff]', '', text)

# 2. Replace entity["type","name","desc"] → name
text = re.sub(r'entity\["[^"]*","([^"]*)","[^"]*"\]', r'\1', text)

# 3. Remove citeturn... strings (the [N] refs are already inline)
text = re.sub(r'\s*citeturn\S+', '', text)

# 4. Remove image_group{...} lines
text = re.sub(r'^.*image_group\{.*\}.*$', '', text, flags=re.MULTILINE)

# 5. Normalize whitespace
text = re.sub(r'  +', ' ', text)           # collapse double spaces
text = re.sub(r'\n{3,}', '\n\n', text)     # collapse triple+ newlines
text = re.sub(r' +$', '', text, flags=re.MULTILINE)  # strip trailing whitespace

Output files:

Rename raw file to {base}_raw.md (preserves original with artifacts for reference)
Write cleaned text to {base}_clean.md

Report artifact counts: "Removed X entity tags, Y citation markers, Z PUA characters."

Claude reports (already clean):

Claude deep research outputs have no known artifacts (verified: zero PUA chars, zero entity tags, zero citeturn markers).

Rename the original file directly to {base}_clean.md (no separate _raw.md since there are no artifacts to preserve)
Apply only whitespace normalization (collapse blank lines, strip trailing spaces)

Report: "Claude report — no artifacts detected. Renamed to {base}_clean.md."

Step 3: Detect Report Type and Validate YAML Header

Parse the YAML front matter from the clean file using Python yaml.safe_load().

Report type detection: Check query.report_type to determine which parser to use:

"cluster" (or absent) → cluster cell-type-annotation parser
"family" → family cell-type-annotation parser (see Step 3c below)
"nonmetazoan_characterization" → nonmetazoan parser (see Step 3b below)

Folded block scalars (summary: >) — the continuation text lands at the same indent as sibling keys, so the parser can't distinguish block scalar content from a new key
List item mapping continuations (- cell_type: "..." followed by organism: "...") — continuation keys land at the same indent as the - marker, so they're parsed as siblings rather than part of the list item's mapping

The fix: try yaml.safe_load() first. If it fails, apply fix_flat_yaml() (for zero-indent ChatGPT Pro output), then fix_chatgpt_yaml() (for 1-space indent issues), then retry.

`fix_flat_yaml` — zero-indent ChatGPT Pro / extended thinking output

def fix_flat_yaml(yaml_text):
    """Fix YAML where all keys are at column 0 (ChatGPT Pro / extended thinking).

    Detects this pattern: top-level section keys (query:, annotation:, markers:)
    followed by child keys at the same indent level. Adds proper indent to all
    child keys under each section. Distinguishes mapping list items ("- key: val"
    → continuations get 6-space indent) from simple list items ('- "value"' →
    next sibling key resets to 2-space).
    """
    lines = yaml_text.split('\n')
    # Quick check: if we see 'query:' at col 0 followed by 'organism:' at col 0,
    # this is a flat YAML that needs fixing
    has_flat_pattern = False
    for i, line in enumerate(lines):
        if line.strip() == 'query:' and i + 1 < len(lines):
            next_line = lines[i + 1]
            if next_line and not next_line.startswith(' ') and ':' in next_line:
                has_flat_pattern = True
                break
    
    if not has_flat_pattern:
        return yaml_text  # Already properly indented
    
    fixed = []
    section = None
    section_keys = {'query', 'annotation', 'markers', 'classification',
                    'functional_categories', 'hgt_candidates', 'expression_patterns',
                    'biology'}
    in_block_scalar = False
    in_mapping_list = False  # inside "- key: val" list item (continuations need 6-space)
    in_simple_list = False   # inside "- value" list (next sibling key resets to 2-space)

    for line in lines:
        stripped = line.strip()
        n_spaces = len(line) - len(line.lstrip())

        if not stripped:
            in_block_scalar = False
            in_mapping_list = False
            in_simple_list = False
            fixed.append('')
            continue

        # Block scalar continuation
        if in_block_scalar and n_spaces == 0 and not any(
            stripped.startswith(k + ':') for k in section_keys
        ):
            fixed.append('    ' + stripped)
            continue
        elif in_block_scalar:
            in_block_scalar = False

        # Top-level section key
        key_candidate = stripped.split(':')[0].strip()
        if n_spaces == 0 and key_candidate in section_keys and stripped.endswith(':'):
            section = key_candidate
            in_mapping_list = False
            in_simple_list = False
            fixed.append(line)
            continue

        # Detect block scalar start
        if stripped.endswith(': >') or stripped.endswith(': |'):
            in_block_scalar = True

        # Child content under a section at col 0
        if section and n_spaces == 0:
            if stripped.startswith('---'):
                section = None
                in_mapping_list = False
                in_simple_list = False
                fixed.append(line)
                continue
            if key_candidate in section_keys:
                section = key_candidate
                in_mapping_list = False
                in_simple_list = False
                fixed.append(line)
                continue
            # List items
            if stripped.startswith('- '):
                after_dash = stripped[2:].strip()
                if re.match(r'[\w_-]+\s*:', after_dash):
                    in_mapping_list = True
                    in_simple_list = False
                else:
                    in_simple_list = True
                    in_mapping_list = False
                fixed.append('    ' + stripped)
                continue
            # Continuation key inside a mapping list item
            if in_mapping_list and re.match(r'[\w_-]+\s*:', stripped):
                fixed.append('      ' + stripped)
                continue
            # Regular key (not inside a list, or after simple list ended)
            in_mapping_list = False
            in_simple_list = False
            fixed.append('  ' + stripped)
            continue

        # Already indented — add 2 more spaces for section nesting
        if section and n_spaces > 0:
            fixed.append('  ' + line)
            continue

        fixed.append(line)

    
    return '\n'.join(fixed)

`fix_chatgpt_yaml` — 1-space indent issues

This existing fixer handles the two specific constructs broken by 1-space indentation:

After a line ending in : > or : |, adds 1 extra space to continuation lines (non-key, non-blank lines at the same indent) until a blank line or a real key is encountered
After a line matching - key: value (list item mapping start), adds 2 extra spaces to subsequent non-- key lines at the same indent (these are continuation keys inside the mapping)
Leaves all other indentation unchanged

def fix_chatgpt_yaml(yaml_text):
    """Fix ChatGPT 1-space indent YAML: folded block scalars + list item mapping continuations."""
    lines = yaml_text.split('\n')
    fixed = []
    in_block_scalar = False
    block_scalar_key_indent = -1
    in_list_item = False
    list_item_indent = -1

    for i, line in enumerate(lines):
        stripped = line.lstrip(' ')
        n_spaces = len(line) - len(stripped)

        if stripped == '':
            in_block_scalar = False
            fixed.append(line)
            continue

        # FIX 1: Folded/literal block scalar continuation
        if in_block_scalar:
            if (n_spaces == block_scalar_key_indent
                    and not re.match(r'[\w-]+\s*:', stripped)
                    and not stripped.startswith('- ')):
                fixed.append(' ' * (n_spaces + 1) + stripped)
                continue
            else:
                in_block_scalar = False

        # FIX 2: List item mapping continuation
        if in_list_item:
            if (n_spaces == list_item_indent
                    and re.match(r'[\w_-]+\s*:', stripped)
                    and not stripped.startswith('- ')):
                fixed.append(' ' * (n_spaces + 2) + stripped)
                continue
            elif n_spaces <= list_item_indent and not (
                    n_spaces == list_item_indent and stripped.startswith('- ')):
                in_list_item = False

        if re.search(r':\s*[>|]\s*$', stripped):
            in_block_scalar = True
            block_scalar_key_indent = n_spaces

        if re.match(r'- [\w_-]+\s*:', stripped):
            in_list_item = True
            list_item_indent = n_spaces

        fixed.append(line)

    return '\n'.join(fixed)

This handles both 1-space and 2-space ChatGPT output without breaking valid YAML.

Validate required fields:

Required:

query.module_id, query.organism, query.n_genes
annotation.proposed_name, annotation.confidence, annotation.one_line
markers.top_diagnostic (non-empty list)

Optional but checked:

query.source_object, query.clustering_column, query.marker_file — warn if missing (these are provenance fields added in v2 of the genelist skill; older prompts won't have them)
query.comparison_mode, query.clade_family — present when merged marker mode was used; absent for single-list prompts or older prompts. Default to empty string in summary table.
query.report_type — "cluster" (default) or "family". Family reports are generated by the family-aware mode of the genelist skill.
query.member_clusters — list of fine cluster names (family reports only). Default to empty list.
query.n_member_clusters — integer count (family reports only). Default to 0.

If validation fails, report which fields are missing and ask whether to proceed.

Step 3b: Nonmetazoan Report Validation

When query.report_type == "nonmetazoan_characterization", use this validation instead:

Detect template variant from which classification fields are present:

Has classification.n_candidate_hgt → prokaryote variant (Bacteria, Archaea, combined)
Has classification.n_database_bias → eukaryote variant (Fungi, Viridiplantae, etc.)
Has both or neither → warn and proceed with best guess

Required fields (both variants):

query.kingdom, query.organism, query.n_genes, query.n_expressed
classification.assessment_confidence
At least one entry in functional_categories

Prokaryote-specific required:

classification.n_symbiont_transcript, classification.n_candidate_hgt, classification.n_conserved, classification.n_ambiguous (can be 0)

Eukaryote-specific required:

classification.n_database_bias, classification.n_conserved, classification.n_symbiont_transcript, classification.n_lateral_transfer, classification.n_ambiguous (can be 0)

Optional but checked:

hgt_candidates — warn if absent for prokaryote reports
expression_patterns.cell_type_enriched_genes — warn if empty
biology.symbiosis_relevant — prokaryote only
biology.evolutionary_insight — eukaryote only

Canonical base name for nonmetazoan reports: Pattern: YYMMDD_platform_kingdom (e.g., 260315_chatgpt_prokaryote, 260315_claude_fungi)

kingdom from query.kingdom, lowercased, spaces replaced with underscores

Step 3c: Family Report Validation

When query.report_type == "family", use this validation:

Required:

query.family, query.organism, query.n_modules or query.n_member_clusters
annotation.proposed_family_name, annotation.confidence, annotation.one_line
annotation.per_cluster (non-empty list with at least cluster and proposed_name per entry)

Optional but checked:

query.subfamilies — subfamily structure. Default empty.
annotation.best_matches — comparative matches. Default empty list.
annotation.family_conservation — conservation level. Default empty.
markers.family_defining — family marker list. Warn if empty.
markers.family_specifying_tfs, markers.cluster_specifying_tfs — TF annotations.

Canonical base name: YYMMDD_platform_family_FAMILYID

Directory: Reports go in the same directory as the input file (family reports already live in their own subdirectory, e.g., outs/deep_research_platynereis/family_C/).

Step 4: Ask Output Format

Use AskUserQuestion:

Both (Recommended) — PDF for sharing, HTML for browsing with clickable DOIs
PDF only — LaTeX-rendered, good for sharing/printing
HTML only — Standalone, nice table styling, clickable links

Step 5: Generate Outputs

Create a temporary _report_for_render.md from {base}_clean.md:

Strip the data YAML header — the large query:/annotation:/markers: (or classification:/functional_categories:) block is for programmatic parsing only, not for display
Insert a pandoc formatting YAML. The title format depends on report type:
- Cell-type annotation: "Gene Module Interpretation: *{organism}* {module_id}"
- Family annotation: "Cell Type Family Report: *{organism}* Family {family}"
- Nonmetazoan: "Non-Metazoan Gene Characterization: *{organism}* — {kingdom}"
```
---
title: "<title per report type>"
subtitle: "Deep Research Report — {Platform} ({date_generated})"
geometry: margin=1in
fontsize: 11pt
header-includes:
  - \renewcommand{\arraystretch}{1.4}
  - \usepackage{booktabs}
  - \usepackage{longtable}
---
```
The \arraystretch{1.4} adds 40% vertical padding to every table row — critical for readability.
Remove the H1 title from the body (it's now in the YAML title: field)
Normalize heading levels — ChatGPT reports nest headings too deep, producing tiny headers at H4/H5 in PDFs. Apply these fixes in order:

a. Remove redundant heading pairs. ChatGPT often outputs a generic ## heading immediately followed by the lettered ### section (e.g., ## Comparative biological analysis then ### H. Comparative Biological Analysis). Remove the redundant ## line:
```
body = re.sub(r'^##\s+[^\n]+\n\n(###\s+[A-L]\.)', r'\1', body, flags=re.MULTILINE)
```
b. Shift all headings up one level. Since the H1 title was removed, the remaining hierarchy is too deep (##→####). Shift ##→#, ###→##, ####→###, etc. (never shift H1):
```
def shift_heading(m):
    hashes = m.group(1)
    rest = m.group(2)
    if len(hashes) > 1:
        return '#' * (len(hashes) - 1) + rest
    return m.group(0)
body = re.sub(r'^(#{2,})([ \t].*)$', shift_heading, body, flags=re.MULTILINE)
```
After normalization, the typical hierarchy is: # (sections A–K), ## (subsections like "Core identity modules"), ### (sub-subsections).
Escape backslashes in gene names for LaTeX. ChatGPT reports sometimes include gene names with literal backslashes (e.g., Dmel\cg5579), which LaTeX interprets as control sequences. Escape them in the body only:
```
body = re.sub(r'\\(?=[a-zA-Z])', r'\\\\', body)
```

Generate PDF:

{quarto_path} pandoc \
  "{base}_report_for_render.md" \
  -o "{base}_report.pdf" \
  --pdf-engine=xelatex \
  -V colorlinks=true -V linkcolor=blue -V urlcolor=blue

On cluster, prepend module load texlive && before the quarto command.

Generate HTML:

{quarto_path} pandoc \
  "{base}_report_for_render.md" \
  -o "{base}_report.html" \
  --standalone \
  --css="$HOME/.claude/skills/deep-research-reports/templates/report-style.css" \
  --embed-resources

The --embed-resources flag inlines the CSS so the HTML is fully self-contained.

Delete _report_for_render.md after successful conversion.

If pandoc fails (e.g., LaTeX error), fall back to HTML only and notify the user.

Step 6: Update Summary Table

Extract fields from validated YAML and update outs/deep_research/annotation_summary.tsv.

Composite key: module_id + platform + date_generated. This allows the table to hold both ChatGPT and Claude annotations for the same cluster side by side.

Logic:

Read existing TSV if it exists (using Python csv with delimiter='\t')
Check if a row with matching composite key exists
If yes: show old vs new proposed_name, ask to update or skip
If no: append new row
Sort by module_id, then platform
Write back TSV

Columns — parse ALL YAML fields into the summary table:

Field mapping from YAML:

Fields may be absent in older reports — default to empty string.

Family report field mappings: For family reports, the summary table uses the same annotation_summary.tsv but with these field mappings:

All other columns map the same as cluster reports. The per_cluster annotations are NOT added to the summary table row — they are available in the clean markdown file for detailed parsing.

Step 6b: Nonmetazoan Summary Table

Composite key: kingdom + platform + date_generated

Columns:

Step 7: Report Results

Cell-type-annotation reports:

Processed: clade6sub25_annotation_report.md
  Platform: Claude
  Renamed: clade6sub25/260304_claude_clade6sub25_clean.md
  PDF: outs/deep_research/clade6sub25/260304_claude_clade6sub25_report.pdf
  HTML: outs/deep_research/clade6sub25/260304_claude_clade6sub25_report.html
  Summary: annotation_summary.tsv (new row added)

  Annotation: "hemocyte-like immune/scavenger cells (GCM+ phagocytes)"
  Confidence: medium
  Cell type family: immune/scavenger (hemocyte-like)

Nonmetazoan characterization reports:

Processed: deep-research-report (24).md
  Platform: ChatGPT
  Report type: nonmetazoan_characterization (prokaryote variant)
  Cleaned: prokaryote/260315_chatgpt_prokaryote_clean.md
  PDF: prokaryote/260315_chatgpt_prokaryote_report.pdf
  HTML: prokaryote/260315_chatgpt_prokaryote_report.html
  Summary: nonmetazoan_summary.tsv (new row added)

  Kingdom: Prokaryote (369 genes, 178 expressed)
  Classification: 310 symbiont, 12 HGT, 25 conserved, 22 ambiguous
  Confidence: low
  Top HGT candidates: c102759-g4 (Protein-ADP-ribose hydrolase), c101192-g1 (Deubiquitinase)

Batch Mode

When processing multiple reports (via "all" or "new"), the skill:

Processes each report sequentially through Steps 2–6
Asks the output format question once (applies to all reports)
At the end, shows a summary table of all processed reports
Rebuilds the summary TSV from all clean files to ensure consistency

Notes

The _clean.md files are the archival versions — they retain the YAML header for parsing and are the source of truth for the summary table.
The prompt file ({module_id}_prompt.md) is shared across platforms and dates — it doesn't get the date/platform prefix.
If a report has no YAML front matter at all, report the error clearly and skip that file.
Platform detection from artifacts is deterministic: PUA chars are unique to ChatGPT's export. If future platforms introduce new artifacts, add detection patterns to templates/cleaning-patterns.md.

Adoption

musserlab/deep-research-reports

$ install --global

Security Scan Results

SKILL.md

Deep Research Report Processor

Workflow

Step 1: Identify Input

Directory structure

Step 2: Detect Platform and Clean

Parse YAML header first

Construct canonical base name

Platform-specific cleaning

Step 3: Detect Report Type and Validate YAML Header

fix_flat_yaml — zero-indent ChatGPT Pro / extended thinking output

fix_chatgpt_yaml — 1-space indent issues

Step 3b: Nonmetazoan Report Validation

Step 3c: Family Report Validation

Step 4: Ask Output Format

Step 5: Generate Outputs

Step 6: Update Summary Table

Step 6b: Nonmetazoan Summary Table

Step 7: Report Results

Batch Mode

Notes

Related Skills

musserlab/tree-formatting

musserlab/security-setup

musserlab/script-organization

musserlab/r-renv

musserlab/deep-research-reports

$ install --global

Security Scan Results

SKILL.md

Deep Research Report Processor

Workflow

Step 1: Identify Input

Directory structure

Step 2: Detect Platform and Clean

Parse YAML header first

Construct canonical base name

Platform-specific cleaning

Step 3: Detect Report Type and Validate YAML Header

fix_flat_yaml — zero-indent ChatGPT Pro / extended thinking output

fix_chatgpt_yaml — 1-space indent issues

Step 3b: Nonmetazoan Report Validation

Step 3c: Family Report Validation

Step 4: Ask Output Format

Step 5: Generate Outputs

Step 6: Update Summary Table

Step 6b: Nonmetazoan Summary Table

Step 7: Report Results

Batch Mode

Notes

Related Skills

musserlab/tree-formatting

musserlab/security-setup

musserlab/script-organization

musserlab/r-renv

`fix_flat_yaml` — zero-indent ChatGPT Pro / extended thinking output

`fix_chatgpt_yaml` — 1-space indent issues

`fix_flat_yaml` — zero-indent ChatGPT Pro / extended thinking output

`fix_chatgpt_yaml` — 1-space indent issues