MOSS-TTS-Nano Speech Generation Skill

Skill by ara.so — Daily 2026 Skills collection.

MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU.

Installation

Conda (recommended)

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Fix WeTextProcessing if it fails

conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

After pip install -e . the moss-tts-nano CLI command is available in the active environment.

Model Weights

Models are auto-downloaded from Hugging Face on first run:

TTS model: OpenMOSS-Team/MOSS-TTS-Nano
Audio tokenizer: OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano

ModelScope mirrors are available at openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano.

CLI Commands

Generate speech (voice clone mode)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Output defaults to generated_audio/moss_tts_nano_output.wav.

Generate from a text file (long-form)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

Launch local web demo

moss-tts-nano serve
# or directly:
python app.py

Opens at http://127.0.0.1:18083 — model stays loaded in memory for fast repeated requests.

Direct Python entrypoint

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."

Output: generated_audio/infer_output.wav

Python API Usage

Basic voice clone inference

from infer import MossTTSNanoInference

# Initialize once (downloads weights on first run)
tts = MossTTSNanoInference()

# Voice clone: synthesize text in the style of the reference audio
audio = tts.infer(
    text="欢迎使用MOSS语音合成系统。",
    prompt_audio_path="assets/audio/zh_1.wav",
)

# Save output
import soundfile as sf
sf.write("output.wav", audio, samplerate=48000)

English voice clone

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

Streaming inference (low latency)

from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # process or play chunk in real time here

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

Long-text synthesis with chunked voice cloning

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

FastAPI HTTP endpoint usage

When the server is running (moss-tts-nano serve or python app.py):

import requests
import base64
import soundfile as sf
import io
import numpy as np

# Read reference audio as base64
with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://127.0.0.1:18083/generate",
    json={
        "text": "你好，这是一个语音合成测试。",
        "prompt_audio_base64": ref_audio_b64,
    },
)

data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])

audio_array, sr = sf.read(io.BytesIO(audio_bytes))
sf.write("api_output.wav", audio_array, samplerate=sr)

Streaming HTTP response (real-time web playback)

import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例，适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

Supported Languages

| Code | Language | Code | Language | Code | Language | |------|-----------|------|-----------|------|----------| | zh | Chinese | en | English | de | German | | es | Spanish | fr | French | ja | Japanese | | it | Italian | hu | Hungarian | ko | Korean | | ru | Russian | fa | Persian | ar | Arabic | | pl | Polish | pt | Portuguese| cs | Czech | | da | Danish | sv | Swedish | el | Greek | | tr | Turkish | | | | |

The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.

Architecture Overview

Pipeline: Audio Tokenizer + LLM (pure autoregressive)
Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
Output: 48 kHz, 2-channel (stereo)
Token rate: 12.5 Hz token stream
Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
LLM: ~0.1B parameters total

Key CLI Flags

| Flag | Alias | Description | |------|-------|-------------| | --prompt-audio-path | — | Path to reference WAV for voice cloning (infer.py) | | --prompt-speech | — | Same purpose in moss-tts-nano generate CLI | | --text | — | Input text string | | --text-file | — | Path to plain text file for long-form synthesis | | --output | — | Output WAV file path (default varies by entrypoint) |

Common Patterns

Pattern: Batch synthesis with one reference voice

from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话，用于批量合成测试。",
    "第二句话，保持相同的音色。",
    "第三句话，输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"Saved output_{i:02d}.wav")

Pattern: Real-time playback with sounddevice

import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

Pattern: Gradio integration

import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # Return as (sample_rate, numpy_array) tuple for Gradio Audio component
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Text to synthesize"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="MOSS-TTS-Nano Voice Clone",
)

demo.launch()

Troubleshooting

WeTextProcessing install fails

# Use conda to get pynini, then install from source
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

Model download is slow or fails

Set HF_ENDPOINT to a mirror if Hugging Face is unreachable:

export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"

Or use ModelScope:

pip install modelscope

Then point model paths to openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano.

Out of memory on CPU

Use streaming inference (infer_stream) to reduce peak memory.
Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
Close other applications; the model needs ~1–2 GB RAM.

Audio output is silent or corrupt

Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
Avoid reference audio with heavy background noise.

`moss-tts-nano` command not found

# Re-run editable install inside the active conda env
pip install -e .
which moss-tts-nano   # should resolve now

Port conflict for web demo

# Default port is 18083; check what occupies it
lsof -i :18083
# Kill if needed, then relaunch
moss-tts-nano serve

Output Defaults

| Entrypoint | Default output path | |---|---| | python infer.py | generated_audio/infer_output.wav | | moss-tts-nano generate | generated_audio/moss_tts_nano_output.wav | | python app.py / moss-tts-nano serve | returned via HTTP response |

The generated_audio/ directory is created automatically if it does not exist.

MOSS-TTS-Nano Speech Generation Skill

Skill by ara.so — Daily 2026 Skills collection.

Installation

Conda (recommended)

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Fix WeTextProcessing if it fails

conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

After pip install -e . the moss-tts-nano CLI command is available in the active environment.

Model Weights

Models are auto-downloaded from Hugging Face on first run:

TTS model: OpenMOSS-Team/MOSS-TTS-Nano
Audio tokenizer: OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano

ModelScope mirrors are available at openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano.

CLI Commands

Generate speech (voice clone mode)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Output defaults to generated_audio/moss_tts_nano_output.wav.

Generate from a text file (long-form)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

Launch local web demo

moss-tts-nano serve
# or directly:
python app.py

Opens at http://127.0.0.1:18083 — model stays loaded in memory for fast repeated requests.

Direct Python entrypoint

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."

Output: generated_audio/infer_output.wav

Python API Usage

Basic voice clone inference

from infer import MossTTSNanoInference

# Initialize once (downloads weights on first run)
tts = MossTTSNanoInference()

# Voice clone: synthesize text in the style of the reference audio
audio = tts.infer(
    text="欢迎使用MOSS语音合成系统。",
    prompt_audio_path="assets/audio/zh_1.wav",
)

# Save output
import soundfile as sf
sf.write("output.wav", audio, samplerate=48000)

English voice clone

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

Streaming inference (low latency)

from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # process or play chunk in real time here

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

Long-text synthesis with chunked voice cloning

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

FastAPI HTTP endpoint usage

When the server is running (moss-tts-nano serve or python app.py):

import requests
import base64
import soundfile as sf
import io
import numpy as np

# Read reference audio as base64
with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://127.0.0.1:18083/generate",
    json={
        "text": "你好，这是一个语音合成测试。",
        "prompt_audio_base64": ref_audio_b64,
    },
)

data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])

audio_array, sr = sf.read(io.BytesIO(audio_bytes))
sf.write("api_output.wav", audio_array, samplerate=sr)

Streaming HTTP response (real-time web playback)

import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例，适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

Supported Languages

The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.

Architecture Overview

Pipeline: Audio Tokenizer + LLM (pure autoregressive)
Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
Output: 48 kHz, 2-channel (stereo)
Token rate: 12.5 Hz token stream
Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
LLM: ~0.1B parameters total

Key CLI Flags

Common Patterns

Pattern: Batch synthesis with one reference voice

from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话，用于批量合成测试。",
    "第二句话，保持相同的音色。",
    "第三句话，输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"Saved output_{i:02d}.wav")

Pattern: Real-time playback with sounddevice

import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

Pattern: Gradio integration

import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # Return as (sample_rate, numpy_array) tuple for Gradio Audio component
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Text to synthesize"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="MOSS-TTS-Nano Voice Clone",
)

demo.launch()

Troubleshooting

WeTextProcessing install fails

# Use conda to get pynini, then install from source
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

Model download is slow or fails

Set HF_ENDPOINT to a mirror if Hugging Face is unreachable:

export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"

Or use ModelScope:

pip install modelscope

Then point model paths to openmoss/MOSS-TTS-Nano and openmoss/MOSS-Audio-Tokenizer-Nano.

Out of memory on CPU

Use streaming inference (infer_stream) to reduce peak memory.
Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
Close other applications; the model needs ~1–2 GB RAM.

Audio output is silent or corrupt

Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
Avoid reference audio with heavy background noise.

`moss-tts-nano` command not found

# Re-run editable install inside the active conda env
pip install -e .
which moss-tts-nano   # should resolve now

Port conflict for web demo

# Default port is 18083; check what occupies it
lsof -i :18083
# Kill if needed, then relaunch
moss-tts-nano serve

Output Defaults

The generated_audio/ directory is created automatically if it does not exist.

Adoption

aradotso/moss-tts-nano-speech

$ install --global

Security Scan Results

SKILL.md

MOSS-TTS-Nano Speech Generation Skill

Installation

Conda (recommended)

Fix WeTextProcessing if it fails

Model Weights

CLI Commands

Generate speech (voice clone mode)

Generate from a text file (long-form)

Launch local web demo

Direct Python entrypoint

Python API Usage

Basic voice clone inference

English voice clone

Streaming inference (low latency)

Long-text synthesis with chunked voice cloning

FastAPI HTTP endpoint usage

Streaming HTTP response (real-time web playback)

Supported Languages

Architecture Overview

Key CLI Flags

Common Patterns

Pattern: Batch synthesis with one reference voice

Pattern: Real-time playback with sounddevice

Pattern: Gradio integration

Troubleshooting

WeTextProcessing install fails

Model download is slow or fails

Out of memory on CPU

Audio output is silent or corrupt

moss-tts-nano command not found

Port conflict for web demo

Output Defaults

Related Skills

aradotso/skills/compose-performance-skills

aradotso/baguette-ios-simulator

aradotso/skills/claude-code-game-studios

aradotso/skills/xq-py-quantum-vm

aradotso/moss-tts-nano-speech

$ install --global

Security Scan Results

SKILL.md

MOSS-TTS-Nano Speech Generation Skill

Installation

Conda (recommended)

Fix WeTextProcessing if it fails

Model Weights

CLI Commands

Generate speech (voice clone mode)

Generate from a text file (long-form)

Launch local web demo

Direct Python entrypoint

Python API Usage

Basic voice clone inference

English voice clone

Streaming inference (low latency)

Long-text synthesis with chunked voice cloning

FastAPI HTTP endpoint usage

Streaming HTTP response (real-time web playback)

Supported Languages

Architecture Overview

Key CLI Flags

Common Patterns

Pattern: Batch synthesis with one reference voice

Pattern: Real-time playback with sounddevice

Pattern: Gradio integration

Troubleshooting

WeTextProcessing install fails

Model download is slow or fails

Out of memory on CPU

Audio output is silent or corrupt

moss-tts-nano command not found

Port conflict for web demo

Output Defaults

Related Skills

aradotso/skills/compose-performance-skills

`moss-tts-nano` command not found

`moss-tts-nano` command not found