.claude/skills/_archive/telegram-voice-pipeline/SKILL.md
End-to-end voice message pipeline for Telegram — download OGG attachment, transcribe with Whisper, generate a text response, convert to MP3 via ElevenLabs TTS, and reply with the audio file.
npx skillsauth add oimiragieo/agent-studio telegram-voice-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill handles Telegram voice messages end-to-end:
attachment_file_id in channel tagtranscribe-anything)Skill({ skill: 'telegram-voice-pipeline' });
Invoke when:
attachment_file_id attribute (voice or audio message)Telegram voice messages arrive as channel tags:
<channel source="telegram" chat_id="123456" message_id="789" user="username" ts="1234567890" attachment_file_id="BQACAgIAAxkBAAIBc2...">
Key detection logic:
attachment_file_id present → voice/audio message → invoke this skillimage_path present → photo → use image handling insteadCall the Telegram MCP download tool with the file_id from the channel tag:
// MCP tool call (agent uses this directly)
mcp__telegram - relay__download_attachment({ file_id: '<attachment_file_id>' });
// Returns: local file path, e.g. /tmp/voice_abc123.ogg
Verify: The returned path exists and is non-empty (> 1KB for a real voice message).
Error handling: If download fails, reply with a text message: "Sorry, I couldn't download your voice message. Please try again."
Install transcribe-anything if not present:
pip install transcribe-anything
Run transcription:
transcribe-anything /tmp/voice_abc123.ogg --model medium --output_dir /tmp/tg_voice/
Read the transcript:
cat /tmp/tg_voice/voice_abc123.txt
Model selection (trade-off between speed and accuracy):
| Model | Speed | Accuracy | Use when |
| ---------- | ----- | -------- | ------------------------------ |
| tiny | ~2s | Low | Rapid prototyping only |
| small | ~5s | Medium | Short messages, speed priority |
| medium | ~12s | High | Default — best balance |
| large-v3 | ~30s | Best | Long/complex messages |
Override via env: WHISPER_MODEL=small (default: medium)
Verify: /tmp/tg_voice/<filename>.txt exists and is non-empty.
Error handling: If transcription fails or output is empty, reply: "I received your voice message but couldn't transcribe it. Could you try sending it again or type your message?"
Use the transcribed text as the user input. Generate a text response using the agent's normal response logic.
transcribed_text = contents of /tmp/tg_voice/<filename>.txt
response_text = <agent's generated response to transcribed_text>
Guard max length for TTS: response_text[:4000] (ElevenLabs limit) or response_text[:4096] (OpenAI TTS limit).
ELEVENLABS_API_KEY)import os
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
audio = client.text_to_speech.convert(
text=response_text[:4000],
voice_id="JBFqnCBsd6RMkjVDRZzb", # George — clear, neutral voice
model_id="eleven_turbo_v2",
output_format="mp3_44100_128",
)
output_path = "/tmp/tg_voice_response.mp3"
with open(output_path, "wb") as f:
for chunk in audio:
f.write(chunk)
print(f"TTS written to {output_path}")
Override voice via env: ELEVENLABS_VOICE_ID=<voice_id> (default: JBFqnCBsd6RMkjVDRZzb)
OPENAI_API_KEY, no ELEVENLABS_API_KEY)import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="nova",
input=response_text[:4096],
) as response:
response.stream_to_file("/tmp/tg_voice_response.mp3")
print("TTS written to /tmp/tg_voice_response.mp3")
Selection logic:
if os.environ.get("ELEVENLABS_API_KEY"):
# Use ElevenLabs
else:
# Use OpenAI TTS fallback
Verify: /tmp/tg_voice_response.mp3 exists and is > 1KB.
Error handling: If TTS fails, send response_text as a plain text reply instead of audio.
// MCP tool call
mcp__telegram -
relay__reply({
chat_id: '<chat_id from channel tag>',
text: response_text, // Also send the transcript so user can read it
files: ['/tmp/tg_voice_response.mp3'],
});
Note: Including text alongside the audio file gives the user both a readable transcript and the audio reply — useful for accessibility and noisy environments.
Verify: No error returned from the reply tool.
After a successful reply, clean up to avoid disk accumulation:
rm -f /tmp/tg_voice_response.mp3
rm -rf /tmp/tg_voice/
| Variable | Required | Default | Purpose |
| --------------------- | ---------------------- | ---------------------- | ------------------------- |
| ELEVENLABS_API_KEY | NO (if OpenAI set) | — | ElevenLabs TTS API key |
| ELEVENLABS_VOICE_ID | NO | JBFqnCBsd6RMkjVDRZzb | ElevenLabs voice (George) |
| OPENAI_API_KEY | NO (if ElevenLabs set) | — | OpenAI TTS fallback key |
| WHISPER_MODEL | NO | medium | Whisper model size |
At least one of ELEVENLABS_API_KEY or OPENAI_API_KEY must be set for TTS to work.
[Telegram] User sends 10-second voice message
↓
[Agent] Detects attachment_file_id in channel tag
↓
[MCP] download_attachment(file_id) → /tmp/voice_abc123.ogg
↓
[Bash] transcribe-anything /tmp/voice_abc123.ogg --model medium → "What is the weather like today?"
↓
[Agent] Generates response: "I don't have real-time weather data, but I can help you check..."
↓
[Python] ElevenLabs TTS → /tmp/tg_voice_response.mp3
↓
[MCP] reply(chat_id, text="I don't have...", files=["/tmp/tg_voice_response.mp3"])
↓
[Telegram] User receives text + audio reply
↓
[Bash] rm /tmp/tg_voice_response.mp3 && rm -rf /tmp/tg_voice/
Total time for 10-second voice message: ~15-25 seconds (download 1s + transcribe 12s + TTS 2s + reply 1s).
attachment_file_id is not a file path, it must be resolved via the MCP toolshell: true for subprocess calls in transcription — use array args with shell: falseimage_path, not attachment_file_id; route them differentlyenable-telegram — Start the channel daemon for background Telegram monitoringtranscription — Whisper transcription workflow (used in Step 2)tts-generation — ElevenLabs and OpenAI TTS (used in Step 4)mcp__telegram-relay__download_attachment, mcp__telegram-relay__replytranscribe-anything: https://github.com/modal-labs/transcribe-anythingFor code discovery and search tasks, follow this priority order:
pnpm search:code "<query>" (Primary intent-based search).ripgrep (for exact keyword/regex matches).Before starting:
cat .claude/context/memory/learnings.md
cat .claude/context/memory/decisions.md
After completing:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.