skills/integrations/google/google-speech-to-text/SKILL.md
Transcribe audio files to text using Google Cloud Speech-to-Text API. Load when user mentions 'transcribe', 'speech to text', 'audio to text', 'transcribe audio', 'voice to text', 'transcription', or converting audio/recordings to text.
npx skillsauth add beam-ai-team/beam-next-skills google-speech-to-textInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transcribe audio files (MP3, WAV, FLAC, etc.) to text using Google Cloud Speech-to-Text API. Supports short files (sync) and long files (batch via Cloud Storage).
python3 00-system/skills/google/google-speech-to-text/scripts/transcribe_operations.py transcribe path/to/audio.mp3
python3 00-system/skills/google/google-speech-to-text/scripts/transcribe_operations.py transcribe path/to/audio.mp3 --output transcript.txt
Output is automatically formatted into readable paragraphs. Use --no-format for raw output.
Default: Auto-detects German + English. Use --language to force a single language:
# Auto-detect (de-DE, en-US) - default
python3 ... transcribe audio.opus
# Force German
python3 ... transcribe audio.opus --language de-DE
# Force English
python3 ... transcribe audio.opus --language en-US
transcribe-longFor long local audio or video (e.g. MP4, Opus), use transcribe-long. It extracts audio (if video), splits into 60s chunks, transcribes each, and merges. Requires ffmpeg on PATH (or FFMPEG_PATH).
# Video (e.g. MP4) or long audio – German + English by default
python3 00-system/skills/google/google-speech-to-text/scripts/transcribe_operations.py transcribe-long /path/to/file.mp4 --output transcript.txt
Use the project’s venv if you installed google-cloud-speech there:
.venv-speech/bin/python3 00-system/skills/google/google-speech-to-text/scripts/transcribe_operations.py transcribe-long /path/to/file.mp4 --output transcript.txt
For files >60 seconds without local chunking, upload to Google Cloud Storage and use batch mode:
# 1. Upload to GCS (gsutil or your bucket)
gcloud storage cp audio.mp3 gs://your-bucket/audio.mp3
# 2. Batch transcribe
python3 00-system/skills/google/google-speech-to-text/scripts/transcribe_operations.py transcribe-batch gs://your-bucket/audio.mp3 --output transcript.txt
Speech-to-Text uses Google Cloud (not OAuth Workspace). It requires Application Default Credentials.
# Check if configured
python3 00-system/skills/google/google-speech-to-text/scripts/transcribe_operations.py check
If not configured:
gcloud auth application-default login
Then set GOOGLE_CLOUD_PROJECT in .env or export GOOGLE_CLOUD_PROJECT=your-project-id.
# Option A: User credentials (recommended for local use)
gcloud auth application-default login
# Option B: Service account (for automation)
gcloud iam service-accounts create speech-transcribe --display-name "Speech Transcription"
# ... grant role, then:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
Add to .env at Beam Next root:
GOOGLE_CLOUD_PROJECT=your-project-id
Or use GOOGLE_PROJECT_ID from your existing Google setup (same project can host both Workspace and Speech APIs).
| Format | Sync | Batch | |--------|------|-------| | WAV, FLAC, LINEAR16 | ✅ | ✅ | | MP3, OGG, AMR | ✅ (auto-decode) | ✅ | | WebM, Opus | ✅ | ✅ |
--model long: Best for long-form (meetings, podcasts) - default for batch--model short: Default for sync, optimized for short utterances--model latest_long: Latest long-form model (Chirp 3)| Mode | Limit | Use case | |------|-------|----------| | Sync | ≤60 sec | Quick clips, voice notes | | Batch | ≤480 min | Meetings, podcasts, interviews |
Note: Google Cloud Speech-to-Text is a separate API from Google Workspace (Gmail, Docs). It uses Application Default Credentials, not OAuth tokens. You can use the same Google Cloud project.
development
--- name: taste-skill type: skill version: '1.0' author: Leonxlnx (packaged by Zhichao Li) category: general tags: - frontend - design - anti-slop - landing-page updated: '2026-06-11' visibility: public description: Anti-slop frontend skill for landing pages, portfolios, and redesigns. The agent reads the brief, infers the right design direction, and ships interfaces that do not look templated. Real design systems when applicable, audit-first on redesigns, strict pre-flight check. license: MIT.
development
Use when communicating quantitative information in any form — Slack updates, emails, reports, decks, dashboards, landing pages, product UI, public talks. Covers two integrated layers: (1) making numbers semantically meaningful (translation, anchoring, simplification, story-pairing) and (2) showing numbers cleanly (chart vs table vs prose, chart-by-message, pre-attentive emphasis, color discipline, decluttering). Distilled and integrated from *Show Me the Numbers* (Stephen Few) and *Make Numbers Count* (Chip Heath & Karla Starr). Not for raw data analysis or statistics — this is about communication of numbers, not their derivation.
development
Use when the user wants to design, redesign, shape, critique, audit, polish, clarify, distill, harden, optimize, adapt, animate, colorize, extract, or otherwise improve a frontend interface. Covers websites, landing pages, dashboards, product UI, app shells, components, forms, settings, onboarding, and empty states. Handles UX review, visual hierarchy, information architecture, cognitive load, accessibility, performance, responsive behavior, theming, anti-patterns, typography, fonts, spacing, layout, alignment, color, motion, micro-interactions, UX copy, error states, edge cases, i18n, and reusable design systems or tokens. Also use for bland designs that need to become bolder or more delightful, loud designs that should become quieter, live browser iteration on UI elements, or ambitious visual effects that should feel technically extraordinary. Not for backend-only or non-UI tasks.
tools
Stateful multi-session tutor adapted for Beam — teach a stakeholder to understand, trust, and operate a specific agent, or teach a Solution Engineer a client's business process for delivery. Grounds every lesson in Knowledge Hub sources (real agent graphs, real tasks, transcripts, Linear) before any web resource. Also works for any general topic. Trigger on "teach me", "beam teach", "教我", "onboard <person> on <agent>", "help <stakeholder> understand the agent", "learn this client's process".