skills/calliope-tts-based-narrated-e-book/SKILL.md
Build offline TTS-narrated e-books with exact audio-text synchronization in EPUB 3 Media Overlay format. Use when the user asks to 'create a narrated ebook', 'add TTS audio to an epub', 'build an audiobook with text highlighting', 'synchronize speech with ebook text', 'convert epub to read-aloud format', or 'generate media overlays for epub'.
npx skillsauth add ndpvt-web/arxiv-claude-skills calliope-tts-based-narrated-e-bookInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to help users build offline pipelines that convert standard text EPUB files into narrated e-books with synchronized text highlighting. The core technique, from the Calliope framework (arXiv:2602.10735), captures audio timestamps during TTS synthesis rather than aligning them after the fact, achieving zero drift between spoken audio and on-screen text highlighting. This eliminates the synchronization errors (up to 30+ seconds of drift) that plague forced-alignment approaches like Afaligner or Whisper-based tools.
Deterministic timestamp capture during synthesis, not post-hoc alignment. Most narrated e-book tools generate audio first, then use forced alignment (Dynamic Time Warping or Whisper transcription) to match audio segments back to source text. This introduces cumulative drift: experiments show Storyteller (Whisper-based) drifts over 30 seconds on a short story, and even the best aligner (Syncabook) exceeds the 50ms perceptual lag threshold for over 30% of sentences. Humans tolerate text appearing up to 150ms early but become sensitive to lags beyond just 50ms.
Calliope sidesteps this entirely. Each sentence is synthesized individually through the TTS engine, and its exact duration is recorded at generation time. Timestamps are then computed deterministically: a sentence's start time equals the cumulative duration of all preceding sentences (plus silence padding), and its end time extends to the next sentence's start. This "gapless" contiguity prevents highlight flicker during inter-sentence pauses. The per-sentence waveforms receive a ~50ms fade-out to eliminate boundary click artifacts, and configurable silence padding (default 0.15s) is inserted between sentences.
EPUB 3 Media Overlay packaging. The pipeline wraps each sentence in a <span> with a unique ID, generates per-chapter SMIL files mapping those IDs to clipBegin/clipEnd timestamps, updates the OPF manifest with media-overlay attributes and media:duration metadata, and injects CSS for media:active-class highlighting that adapts to light/dark mode via prefers-color-scheme media queries. The result is a standards-compliant EPUB 3 readable in Thorium Reader, BookFusion, and similar apps.
Parse the EPUB container. Use EbookLib to extract the OPF package document, read the spine order, and enumerate all XHTML content documents and their linked CSS stylesheets.
Traverse each XHTML chapter's DOM. Using BeautifulSoup4, walk block-level elements (paragraphs, headers) to extract narrative text. Skip non-narrative elements like image captions, tables, and metadata blocks.
Segment text into sentences. Apply Unicode canonicalization (NFC), standardize punctuation (curly quotes to straight, em-dashes to standard), then tokenize into sentences. Assign each sentence a unique span ID following the pattern calliope-s{chapter}-{index}.
Handle TTS input constraints. Recursively split sentences exceeding 200 characters at whitespace midpoints. Merge fragments under 60 characters with adjacent sentences. Wrap a try/except around synthesis to catch token-overflow errors and trigger recursive splitting with concatenated output.
Synthesize audio per sentence with timestamp capture. Feed each sentence to the TTS engine (XTTS-v2 or Chatterbox) with the reference voice WAV. Record the exact duration of each generated waveform. Apply a 50ms fade-out filter to the tail of each waveform to eliminate boundary artifacts. Insert silence padding (0.15s default) between sentences.
Compute deterministic timestamps. For sentence i: start_time = sum(durations[0..i-1] + padding[0..i-1]), end_time = start_time_of_sentence[i+1]. The final sentence's end equals total chapter audio length. This gapless scheme prevents highlight flicker.
Concatenate chapter audio. Join all per-sentence waveforms (with padding) into a single WAV/MP3 per chapter. Use FFmpeg for final encoding if MP3 output is desired.
Inject span tags into XHTML. Replace each sentence's text node with a <span id="calliope-s{ch}-{idx}"> wrapper, preserving all parent element attributes and CSS inheritance.
Generate SMIL files. For each chapter, produce a SMIL file containing <seq> of <par> elements, each with a <text src="chapter.xhtml#calliope-s1-3"/> and <audio src="chapter_audio.mp3" clipBegin="12.450s" clipEnd="15.230s"/>.
Update OPF and inject CSS. Add SMIL items to the manifest with media-overlay links. Set media:duration metadata per overlay and for the total publication. Inject active-class CSS: yellow highlight (#fff3a8) for light mode, muted purple (#4a3a6b) for dark mode, using @media (prefers-color-scheme: dark).
Repackage the EPUB. Recompress all modified XHTML, new SMIL files, updated OPF, injected CSS, and audio files into a valid EPUB 3 OCF zip container.
Example 1: Convert a public-domain EPUB to a narrated e-book
User: "I have tale_of_two_cities.epub and a 15-second WAV of a narrator voice. Help me set up a pipeline to create a narrated version with text highlighting."
Approach:
tale_of_two_cities_narrated.epub.Output structure:
tale_of_two_cities_narrated.epub
├── META-INF/container.xml
├── OEBPS/
│ ├── content.opf # Updated with media-overlay refs
│ ├── chapter01.xhtml # Sentences wrapped in <span> tags
│ ├── chapter01_overlay.smil # SMIL timing file
│ ├── audio/chapter01.mp3 # Synthesized narration
│ ├── styles/highlight.css # Active-class highlight styles
│ └── ...
└── mimetype
Example 2: Generate a SMIL file from pre-computed timestamps
User: "I already have sentence-level audio durations in a JSON file. Help me generate the SMIL overlay file."
Approach:
Input (durations.json):
[
{"id": "s1-0", "text": "It was the best of times.", "duration": 2.31, "padding": 0.15},
{"id": "s1-1", "text": "It was the worst of times.", "duration": 2.45, "padding": 0.15},
{"id": "s1-2", "text": "It was the age of wisdom.", "duration": 2.12, "padding": 0.15}
]
Output (chapter01_overlay.smil):
<smil xmlns="http://www.w3.org/ns/SMIL" version="3.0">
<body>
<seq>
<par>
<text src="chapter01.xhtml#s1-0"/>
<audio src="audio/chapter01.mp3" clipBegin="0.000s" clipEnd="2.460s"/>
</par>
<par>
<text src="chapter01.xhtml#s1-1"/>
<audio src="audio/chapter01.mp3" clipBegin="2.460s" clipEnd="5.060s"/>
</par>
<par>
<text src="chapter01.xhtml#s1-2"/>
<audio src="audio/chapter01.mp3" clipBegin="5.060s" clipEnd="7.330s"/>
</par>
</seq>
</body>
</smil>
Note: each clipEnd extends to the next sentence's clipBegin (gapless), not to the raw audio end. This prevents highlight flicker during silence gaps.
Example 3: Add highlight CSS with dark mode support
User: "I need CSS for the media overlay active class that works in both light and dark reading modes."
Output (highlight.css):
/* Light mode: warm yellow highlight */
.-epub-media-overlay-active {
background-color: #fff3a8;
color: #1a1a1a;
}
/* Dark mode: muted purple highlight */
@media (prefers-color-scheme: dark) {
.-epub-media-overlay-active {
background-color: #4a3a6b;
color: #e8e8e8;
}
}
The class name .-epub-media-overlay-active is specified via the media:active-class metadata property in the OPF file.
clipEnd boundaries (each sentence's end = next sentence's start) to prevent highlight flicker during silence padding.| Problem | Cause | Solution |
|---------|-------|----------|
| TTS token overflow | Sentence exceeds model's context window | Recursively split at whitespace midpoint, synthesize fragments, concatenate waveforms and sum durations |
| Very short fragments (<60 chars) | Over-aggressive sentence splitting | Merge with adjacent sentence before synthesis to avoid unnatural prosody breaks |
| Audio click artifacts at boundaries | Abrupt waveform termination | Apply 50ms exponential fade-out to each sentence waveform tail |
| EPUB validation failure | Missing SMIL references in OPF | Ensure every media-overlay attribute in the manifest points to an existing SMIL item, and all media:duration values are present |
| Highlight not appearing in reader | Missing or wrong active-class CSS | Verify media:active-class in OPF metadata matches the CSS class name exactly (including the leading dot convention) |
| Unicode text garbling | Mixed encodings in source EPUB | Apply NFC Unicode canonicalization and normalize punctuation (curly quotes, em-dashes) before sentence tokenization |
--gpu flag) is strongly recommended for production use.development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".