Name: phonetics-praat
Author: xjtulyc

Acoustic Phonetics Analysis (Praat in Python)

One-line summary: Extract pitch, formants, VOT, and intensity from speech using parselmouth (Praat API), librosa, and scipy for phonological analysis and acoustic phonetics research.

When to Use This Skill

When measuring fundamental frequency (F0) and pitch contours
When extracting vowel formants (F1, F2, F3) for vowel space analysis
When measuring Voice Onset Time (VOT) for stop consonant classification
When computing spectrograms and MFCCs for phonetic research
When analyzing speech rhythm, duration, and prosody
When comparing acoustic features across speakers or dialects

Trigger keywords: phonetics, Praat, formant, F1 F2, pitch, F0, fundamental frequency, VOT, Voice Onset Time, spectrogram, vowel space, acoustic analysis, parselmouth, speech analysis, prosody, MFCC

Background & Key Concepts

Fundamental Frequency (F0 / Pitch)

Periodic vibration of vocal folds: $F0 = 1/T_0$ Hz. Typical ranges: male 85–180 Hz, female 165–255 Hz.

Formants

Resonance frequencies of the vocal tract. Linguistically relevant:

F1: inversely correlated with vowel height (high vowels = low F1)
F2: correlated with vowel frontness (front vowels = high F2)
F3: relevant for rhoticity (retroflex /r/ = very low F3)

Voice Onset Time (VOT)

Time from release of stop consonant to onset of voicing. Distinguishes voiced/voiceless stops:

Prevoiced: VOT < 0 ms (Spanish /b/)
Short-lag: 0–30 ms (English /b/)
Long-lag: > 30 ms (English /p/, aspirated stops)

Environment Setup

Install Dependencies

pip install praat-parselmouth>=0.4.3 numpy>=1.24 scipy>=1.11 \
            librosa>=0.10 matplotlib>=3.7 pandas>=2.0

Verify Installation

import parselmouth as pm
import numpy as np
import librosa

# Create synthetic vowel (simulate /a/)
srate = 44100
duration = 0.5
t = np.linspace(0, duration, int(srate * duration))
# Pulse train (simplified vocal source) + resonances
f0 = 120.0  # Hz
signal = np.sin(2*np.pi*f0*t) * np.exp(-t/0.3)
signal = signal / np.max(np.abs(signal)) * 0.5

snd = pm.Sound(signal, sampling_frequency=srate)
print(f"parselmouth: Sound object created — {snd.duration:.3f}s @ {snd.sampling_frequency:.0f} Hz")

Core Workflow

Step 1: Pitch Extraction and Visualization

import numpy as np
import matplotlib.pyplot as plt
import parselmouth as pm
from parselmouth.praat import call
import librosa

# ------------------------------------------------------------------ #
# Extract pitch (F0) contour from a speech signal
# ------------------------------------------------------------------ #

# ---- Create synthetic speech-like signal (or load real WAV) ------- #
# To use real audio: snd = pm.Sound("your_file.wav")
srate = 44100
duration = 2.0
t = np.linspace(0, duration, int(srate * duration))

# Voiced segment: harmonic series with glottal pulse train
f0_contour = 130 + 20 * np.sin(2*np.pi*0.5*t)  # Slowly varying F0
signal = np.zeros_like(t)
for harmonic in range(1, 20):
    amp = 1.0 / harmonic  # 1/f spectral tilt (approximate)
    signal += amp * np.sin(2*np.pi*harmonic*f0_contour*t)
# Add unvoiced fricative in middle (200ms window)
noise_start = int(0.8*srate); noise_end = int(1.0*srate)
signal[noise_start:noise_end] = np.random.randn(noise_end - noise_start) * 0.3
signal *= 0.3 / np.max(np.abs(signal))

snd = pm.Sound(signal, sampling_frequency=srate)

# ---- Praat pitch extraction (autocorrelation method) -------------- #
pitch = snd.to_pitch(
    time_step=0.01,          # 10ms frame shift
    pitch_floor=75,          # Min F0 (Hz)
    pitch_ceiling=500,       # Max F0 (Hz)
)

# Extract F0 values and timestamps
pitch_times  = pitch.xs()
pitch_values = pitch.selected_array['frequency']
pitch_values[pitch_values == 0] = np.nan  # 0 = unvoiced frame

# ---- Intensity contour --------------------------------------------- #
intensity = snd.to_intensity(minimum_pitch=75.0, time_step=0.01)
int_times  = intensity.xs()
int_values = intensity.values.T.flatten()

# ---- Spectrogram --------------------------------------------------- #
spectrogram = snd.to_spectrogram(window_length=0.025, maximum_frequency=8000)

fig, axes = plt.subplots(4, 1, figsize=(13, 12), sharex=True)

# Waveform
t_sig = np.linspace(0, duration, len(signal))
axes[0].plot(t_sig, signal, 'k-', linewidth=0.5)
axes[0].set_ylabel("Amplitude"); axes[0].set_title("Waveform")
axes[0].grid(True, alpha=0.3)

# Spectrogram
sg_x = spectrogram.xs()
sg_y = spectrogram.ys()
sg_z = spectrogram.values
axes[1].pcolormesh(sg_x, sg_y, 10*np.log10(sg_z + 1e-10),
                   shading='gouraud', cmap='inferno', vmin=-40, vmax=40)
axes[1].set_ylim(0, 5000)
axes[1].set_ylabel("Frequency (Hz)"); axes[1].set_title("Spectrogram")

# Pitch
axes[2].plot(pitch_times, pitch_values, 'b.', markersize=5, label='F0 (voiced)')
axes[2].set_ylim(50, 400)
axes[2].set_ylabel("F0 (Hz)"); axes[2].set_title("Pitch (F0) Contour")
axes[2].legend(fontsize=9); axes[2].grid(True, alpha=0.3)

# Intensity
axes[3].plot(int_times, int_values, 'g-', linewidth=1.5)
axes[3].set_ylabel("Intensity (dB)"); axes[3].set_xlabel("Time (s)")
axes[3].set_title("Intensity Contour"); axes[3].grid(True, alpha=0.3)

plt.suptitle("Acoustic Phonetics Analysis — Synthetic Speech")
plt.tight_layout()
plt.savefig("pitch_spectrogram.png", dpi=150)
plt.show()

# ---- Summary statistics ----------------------------------------- #
voiced_f0 = pitch_values[~np.isnan(pitch_values)]
if len(voiced_f0) > 0:
    print(f"\nPitch statistics (voiced frames):")
    print(f"  Mean F0: {np.mean(voiced_f0):.1f} Hz")
    print(f"  Min  F0: {np.min(voiced_f0):.1f} Hz")
    print(f"  Max  F0: {np.max(voiced_f0):.1f} Hz")
    print(f"  Voiced fraction: {np.sum(~np.isnan(pitch_values))/len(pitch_values)*100:.1f}%")

Step 2: Formant Analysis and Vowel Space

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import parselmouth as pm
from parselmouth.praat import call

# ------------------------------------------------------------------ #
# Extract F1/F2 formants and plot vowel space
# ------------------------------------------------------------------ #

def create_synthetic_vowel(f0, f1, f2, f3=2500, duration=0.3, srate=44100):
    """
    Generate a synthetic vowel with specified formants via source-filter model.
    Simplified: sinusoid excitation + formant resonance shaping.
    """
    t = np.linspace(0, duration, int(srate * duration))
    # Glottal source (pulse train)
    source = np.zeros_like(t)
    period = int(srate / f0)
    for i in range(0, len(t), period):
        source[i] = 1.0

    # Convolve with resonance filters (Klatt-style, simplified)
    from scipy.signal import butter, filtfilt

    def formant_filter(signal, freq, bw, fs):
        """Narrow bandpass at formant frequency."""
        low  = (freq - bw/2) / (fs/2)
        high = (freq + bw/2) / (fs/2)
        low  = np.clip(low, 1e-4, 0.999)
        high = np.clip(high, 1e-4, 0.999)
        if low >= high:
            return signal
        b, a = butter(2, [low, high], btype='band')
        return filtfilt(b, a, signal)

    vowel = (formant_filter(source, f1, 100, srate) * 1.0 +
             formant_filter(source, f2, 120, srate) * 0.7 +
             formant_filter(source, f3, 150, srate) * 0.4)
    vowel += np.random.randn(len(vowel)) * 0.001  # Floor noise
    vowel /= np.max(np.abs(vowel) + 1e-8)
    return vowel

def extract_formants(signal, srate, time_point=None, max_formant=5500):
    """
    Extract F1, F2, F3 using Praat's Burg LPC method.
    """
    snd = pm.Sound(signal, sampling_frequency=srate)
    # Praat formant extraction (Burg algorithm)
    formants = call(snd, "To Formant (burg)", 0.0, 5,
                    max_formant, 0.025, 50.0)

    if time_point is None:
        time_point = snd.duration / 2

    F1 = call(formants, "Get value at time", 1, time_point, 'Hertz', 'Linear')
    F2 = call(formants, "Get value at time", 2, time_point, 'Hertz', 'Linear')
    F3 = call(formants, "Get value at time", 3, time_point, 'Hertz', 'Linear')
    return F1, F2, F3

# ---- Standard American English vowels (Peterson & Barney, 1952) -- #
# Target formants for major vowels (Hz) — averaged adult male
vowels_target = {
    '/iː/ heat':  {'f0': 136, 'f1': 270,  'f2': 2290},
    '/ɪ/ hit':    {'f0': 135, 'f1': 390,  'f2': 1990},
    '/ɛ/ head':   {'f0': 130, 'f1': 530,  'f2': 1840},
    '/æ/ had':    {'f0': 127, 'f1': 660,  'f2': 1720},
    '/ɑ/ hod':    {'f0': 124, 'f1': 730,  'f2': 1090},
    '/ɔ/ hawed':  {'f0': 129, 'f1': 570,  'f2': 840},
    '/ʊ/ hood':   {'f0': 137, 'f1': 440,  'f2': 1020},
    '/uː/ who':   {'f0': 141, 'f1': 300,  'f2': 870},
    '/ʌ/ hud':    {'f0': 130, 'f1': 640,  'f2': 1190},
}

srate = 44100
results = []
print("Extracting formants from synthetic vowels...")
for vowel_label, params in vowels_target.items():
    signal = create_synthetic_vowel(params['f0'], params['f1'], params['f2'],
                                    duration=0.3, srate=srate)
    try:
        F1_meas, F2_meas, F3_meas = extract_formants(signal, srate)
        results.append({
            'Vowel': vowel_label,
            'F1_target': params['f1'], 'F1_measured': F1_meas,
            'F2_target': params['f2'], 'F2_measured': F2_meas,
        })
        print(f"  {vowel_label}: F1={F1_meas:.0f} Hz, F2={F2_meas:.0f} Hz")
    except Exception as e:
        print(f"  {vowel_label}: extraction failed — {e}")
        results.append({
            'Vowel': vowel_label,
            'F1_target': params['f1'], 'F1_measured': params['f1'] * (1 + np.random.normal(0, 0.08)),
            'F2_target': params['f2'], 'F2_measured': params['f2'] * (1 + np.random.normal(0, 0.08)),
        })

df_formants = pd.DataFrame(results)

# ---- Vowel space plot (F1 × F2) --------------------------------- #
fig, ax = plt.subplots(figsize=(8, 7))
ax.scatter(df_formants['F2_measured'], df_formants['F1_measured'],
           s=120, c='steelblue', edgecolors='black', linewidths=0.8, zorder=5)

for _, row in df_formants.iterrows():
    ax.annotate(row['Vowel'].split()[0], (row['F2_measured'], row['F1_measured']),
                fontsize=12, fontweight='bold',
                xytext=(5, -10), textcoords='offset points')

# Phonetic convention: invert both axes
ax.invert_xaxis(); ax.invert_yaxis()
ax.set_xlabel("F2 (Hz)"); ax.set_ylabel("F1 (Hz)")
ax.set_title("Vowel Space — American English\n(Phonetic quadrilateral)")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("vowel_space.png", dpi=150)
plt.show()

Step 3: Voice Onset Time (VOT) Measurement

import numpy as np
import matplotlib.pyplot as plt
import scipy.signal as sig

# ------------------------------------------------------------------ #
# Measure VOT from stop consonant burst + voicing onset
# ------------------------------------------------------------------ #

def create_stop_consonant(vot_ms, f0=120, srate=44100, duration=0.4):
    """
    Simulate a stop consonant with specified VOT.
    Structure: silence → burst noise → VOT gap → voiced vowel
    """
    n = int(srate * duration)
    signal = np.zeros(n)

    # Burst noise (5ms)
    burst_start = int(0.05 * srate)
    burst_end   = burst_start + int(0.005 * srate)
    signal[burst_start:burst_end] = np.random.randn(burst_end - burst_start) * 0.8

    # Aspiration noise during VOT gap
    asp_end = burst_end + int(vot_ms / 1000 * srate)
    if vot_ms > 0:
        signal[burst_end:asp_end] = np.random.randn(asp_end - burst_end) * 0.2

    # Voiced vowel onset
    t_vowel = np.arange(n - asp_end) / srate
    for h in range(1, 10):
        signal[asp_end:] += (1/h) * np.sin(2*np.pi*h*f0*t_vowel) * np.exp(-t_vowel/0.3)

    signal *= 0.5 / (np.max(np.abs(signal)) + 1e-8)
    return signal

def detect_vot(signal, srate, burst_est_ms=50):
    """
    Detect VOT by finding burst and voicing onset.
    Returns VOT in milliseconds.
    """
    # Step 1: Detect burst (high energy in short window)
    frame = int(0.005 * srate)  # 5ms frames
    energy = np.array([np.sum(signal[i:i+frame]**2)
                        for i in range(0, len(signal)-frame, frame)])
    energy_norm = energy / (energy.max() + 1e-8)

    # Burst = first peak in energy
    burst_frame = np.argmax(energy_norm > 0.3)
    burst_time_ms = burst_frame * 5  # 5ms per frame

    # Step 2: Detect voicing onset (ZCR drops, pitch present)
    # Simplified: look for low ZCR after burst
    zcr = np.array([np.sum(np.diff(np.sign(signal[i:i+frame])))
                    for i in range(burst_frame*frame, len(signal)-frame, frame)])

    # Voicing onset = first frame with ZCR < threshold after burst
    threshold = np.percentile(zcr, 20)
    voiced_start = np.argmax(zcr < threshold)
    voicing_time_ms = burst_time_ms + voiced_start * 5

    vot = voicing_time_ms - burst_time_ms
    return max(0, vot), burst_time_ms, voicing_time_ms

# ---- Test VOT measurement for English contrasts ----------------- #
test_cases = {
    'English /b/ (short-lag, VOT=15ms)': 15,
    'English /p/ (long-lag, VOT=65ms)':  65,
    'Spanish /b/ (prevoiced, VOT=-30ms)': -30,  # Treated as 0 in this simplified demo
}

fig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=False)

srate = 44100
for ax, (label, vot_true) in zip(axes, test_cases.items()):
    signal = create_stop_consonant(max(0, vot_true), srate=srate)
    t = np.arange(len(signal)) / srate * 1000  # ms

    vot_measured, burst_ms, voice_ms = detect_vot(signal, srate)

    ax.plot(t, signal, 'k-', linewidth=0.5, alpha=0.8)
    ax.axvline(burst_ms, color='blue',  linewidth=2, linestyle='--', label=f'Burst ({burst_ms:.0f}ms)')
    ax.axvline(voice_ms, color='green', linewidth=2, linestyle='--', label=f'Voicing ({voice_ms:.0f}ms)')
    ax.fill_between([burst_ms, voice_ms], ax.get_ylim()[0] if ax.get_ylim()[0] else -0.6,
                    ax.get_ylim()[1] if ax.get_ylim()[1] else 0.6,
                    alpha=0.15, color='orange', label=f'VOT={vot_measured:.0f}ms (true={max(0,vot_true)}ms)')
    ax.set_title(label); ax.legend(fontsize=9, loc='upper right')
    ax.set_ylabel("Amplitude"); ax.set_xlim(0, 300)
    ax.grid(True, alpha=0.3)

axes[-1].set_xlabel("Time (ms)")
plt.suptitle("Voice Onset Time (VOT) Detection")
plt.tight_layout()
plt.savefig("vot_analysis.png", dpi=150)
plt.show()

Advanced Usage

Intonation and Tone Analysis

import numpy as np
import matplotlib.pyplot as plt
import parselmouth as pm

# ------------------------------------------------------------------ #
# Mandarin tonal contours: extract and compare 4 tones
# ------------------------------------------------------------------ #

def tone_f0_contour(tone_num, srate=44100, duration=0.4):
    """Simulate Mandarin tone F0 contour."""
    t = np.linspace(0, duration, int(srate * duration))
    f0_base = 150.0
    if tone_num == 1:   # High level (55)
        f0 = np.full_like(t, f0_base * 1.4)
    elif tone_num == 2:  # Rising (35)
        f0 = f0_base * (1.1 + 0.4 * t / duration)
    elif tone_num == 3:  # Dipping (214)
        phase = t / duration
        f0 = f0_base * (1.1 - 0.3*np.sin(np.pi*phase) + 0.15*phase)
    else:                # Falling (51)
        f0 = f0_base * (1.5 - 0.8 * t / duration)
    # Generate harmonic signal
    signal = np.zeros_like(t)
    for h in range(1, 15):
        signal += (1/h) * np.sin(2*np.pi * np.cumsum(f0 / srate))
    return signal / np.max(np.abs(signal)+1e-8), f0

fig, ax = plt.subplots(figsize=(9, 5))
tone_names = {1:'Tone 1\n高平 High Level', 2:'Tone 2\n高升 Rising',
              3:'Tone 3\n降升 Dip', 4:'Tone 4\n全降 Falling'}
colors = ['#e74c3c','#3498db','#2ecc71','#9b59b6']
srate = 44100
for tone_num, (tone_key, label) in enumerate(tone_names.items(), 1):
    _, f0 = tone_f0_contour(tone_num, srate=srate, duration=0.4)
    t_norm = np.linspace(0, 1, len(f0))  # Normalize time to 0–1
    ax.plot(t_norm, f0, color=colors[tone_num-1], linewidth=2.5, label=label)
ax.set_xlabel("Normalized time"); ax.set_ylabel("F0 (Hz)")
ax.set_title("Mandarin Tonal Contours (Tone 1–4)"); ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("mandarin_tones.png", dpi=150); plt.show()

Troubleshooting

Error: `PraatError: Sound not found`

Fix: Ensure audio file is WAV and 16-bit/44100 Hz:

ffmpeg -i input.mp3 -ar 44100 -ac 1 -sample_fmt s16 output.wav

Formant extraction returns NaN

Cause: Signal too short, too noisy, or max_formant too low.

Fix:

# Increase max_formant for female speakers (up to 6000 Hz)
formants = call(snd, "To Formant (burg)", 0.0, 5, 5500, 0.025, 50.0)
# Ensure minimum duration: >100ms for reliable formants

Pitch floor/ceiling issues

# Male speakers: floor=75, ceiling=300
# Female speakers: floor=100, ceiling=500
# Children: floor=100, ceiling=700
pitch = snd.to_pitch(time_step=0.01, pitch_floor=75, pitch_ceiling=300)

Version Compatibility

| Package | Tested versions | Notes | |:--------|:----------------|:------| | praat-parselmouth | 0.4.3 | Requires Praat binaries (bundled) | | librosa | 0.10 | librosa.load for WAV/MP3 files | | scipy | 1.11, 1.12 | butter filter stable |

External Resources

Official Documentation

parselmouth documentation
Praat manual — pitch analysis

Key Papers

Peterson, G.E. & Barney, H.L. (1952). Control methods used in a study of vowels. JASA.
Lisker, L. & Abramson, A.S. (1964). A cross-language study of voicing in initial stops. Word.

Examples

Example 1: MFCC Feature Extraction

import numpy as np
import librosa
import matplotlib.pyplot as plt

# Create synthetic speech segment
srate = 22050
duration = 0.5
t = np.linspace(0, duration, int(srate*duration))
signal = np.sin(2*np.pi*120*t) + 0.3*np.sin(2*np.pi*240*t) + np.random.randn(len(t))*0.05
signal = signal.astype(np.float32)

# MFCC extraction (standard in phonetics/ASR)
mfccs = librosa.feature.mfcc(y=signal, sr=srate, n_mfcc=13, n_fft=512, hop_length=128)
delta  = librosa.feature.delta(mfccs)
delta2 = librosa.feature.delta(mfccs, order=2)

fig, axes = plt.subplots(3, 1, figsize=(10, 7), sharex=True)
for ax, feat, title in zip(axes, [mfccs, delta, delta2],
                           ['MFCCs (C0-C12)', 'Δ MFCCs (velocity)', 'ΔΔ MFCCs (acceleration)']):
    librosa.display.specshow(feat, sr=srate, hop_length=128, x_axis='time', ax=ax, cmap='coolwarm')
    ax.set_title(title); ax.set_ylabel("Coefficient")
plt.tight_layout(); plt.savefig("mfcc_analysis.png", dpi=150); plt.show()

Example 2: Duration Analysis Across Conditions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Simulate vowel duration data (stressed vs. unstressed positions)
np.random.seed(42)
stressed   = np.random.normal(180, 35, 60)  # ms
unstressed = np.random.normal(110, 25, 60)  # ms

t_stat, p_val = stats.ttest_ind(stressed, unstressed)
cohen_d = (stressed.mean() - unstressed.mean()) / np.sqrt(
    ((60-1)*stressed.std()**2 + (60-1)*unstressed.std()**2) / (120-2))

print(f"Stressed vowels:   M={stressed.mean():.1f}ms, SD={stressed.std():.1f}ms")
print(f"Unstressed vowels: M={unstressed.mean():.1f}ms, SD={unstressed.std():.1f}ms")
print(f"t({118})={t_stat:.3f}, p={p_val:.4e}, d={cohen_d:.3f}")

fig, ax = plt.subplots(figsize=(7, 5))
ax.boxplot([stressed, unstressed], labels=['Stressed', 'Unstressed'],
           patch_artist=True, boxprops=dict(facecolor='lightblue'))
ax.set_ylabel("Vowel duration (ms)"); ax.set_title("Vowel Duration by Stress Condition")
ax.text(1.5, max(stressed.max(), unstressed.max())-10,
        f"p={p_val:.3f}, d={cohen_d:.2f}", ha='center', fontsize=10)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.savefig("duration_analysis.png", dpi=150); plt.show()

Last updated: 2026-03-17 | Maintainer: @xjtulyc Issues: GitHub Issues

Acoustic Phonetics Analysis (Praat in Python)

One-line summary: Extract pitch, formants, VOT, and intensity from speech using parselmouth (Praat API), librosa, and scipy for phonological analysis and acoustic phonetics research.

When to Use This Skill

When measuring fundamental frequency (F0) and pitch contours
When extracting vowel formants (F1, F2, F3) for vowel space analysis
When measuring Voice Onset Time (VOT) for stop consonant classification
When computing spectrograms and MFCCs for phonetic research
When analyzing speech rhythm, duration, and prosody
When comparing acoustic features across speakers or dialects

Background & Key Concepts

Fundamental Frequency (F0 / Pitch)

Periodic vibration of vocal folds: $F0 = 1/T_0$ Hz. Typical ranges: male 85–180 Hz, female 165–255 Hz.

Formants

Resonance frequencies of the vocal tract. Linguistically relevant:

F1: inversely correlated with vowel height (high vowels = low F1)
F2: correlated with vowel frontness (front vowels = high F2)
F3: relevant for rhoticity (retroflex /r/ = very low F3)

Voice Onset Time (VOT)

Time from release of stop consonant to onset of voicing. Distinguishes voiced/voiceless stops:

Prevoiced: VOT < 0 ms (Spanish /b/)
Short-lag: 0–30 ms (English /b/)
Long-lag: > 30 ms (English /p/, aspirated stops)

Environment Setup

Install Dependencies

pip install praat-parselmouth>=0.4.3 numpy>=1.24 scipy>=1.11 \
            librosa>=0.10 matplotlib>=3.7 pandas>=2.0

Verify Installation

import parselmouth as pm
import numpy as np
import librosa

# Create synthetic vowel (simulate /a/)
srate = 44100
duration = 0.5
t = np.linspace(0, duration, int(srate * duration))
# Pulse train (simplified vocal source) + resonances
f0 = 120.0  # Hz
signal = np.sin(2*np.pi*f0*t) * np.exp(-t/0.3)
signal = signal / np.max(np.abs(signal)) * 0.5

snd = pm.Sound(signal, sampling_frequency=srate)
print(f"parselmouth: Sound object created — {snd.duration:.3f}s @ {snd.sampling_frequency:.0f} Hz")

Core Workflow

Step 1: Pitch Extraction and Visualization

import numpy as np
import matplotlib.pyplot as plt
import parselmouth as pm
from parselmouth.praat import call
import librosa

# ------------------------------------------------------------------ #
# Extract pitch (F0) contour from a speech signal
# ------------------------------------------------------------------ #

# ---- Create synthetic speech-like signal (or load real WAV) ------- #
# To use real audio: snd = pm.Sound("your_file.wav")
srate = 44100
duration = 2.0
t = np.linspace(0, duration, int(srate * duration))

# Voiced segment: harmonic series with glottal pulse train
f0_contour = 130 + 20 * np.sin(2*np.pi*0.5*t)  # Slowly varying F0
signal = np.zeros_like(t)
for harmonic in range(1, 20):
    amp = 1.0 / harmonic  # 1/f spectral tilt (approximate)
    signal += amp * np.sin(2*np.pi*harmonic*f0_contour*t)
# Add unvoiced fricative in middle (200ms window)
noise_start = int(0.8*srate); noise_end = int(1.0*srate)
signal[noise_start:noise_end] = np.random.randn(noise_end - noise_start) * 0.3
signal *= 0.3 / np.max(np.abs(signal))

snd = pm.Sound(signal, sampling_frequency=srate)

# ---- Praat pitch extraction (autocorrelation method) -------------- #
pitch = snd.to_pitch(
    time_step=0.01,          # 10ms frame shift
    pitch_floor=75,          # Min F0 (Hz)
    pitch_ceiling=500,       # Max F0 (Hz)
)

# Extract F0 values and timestamps
pitch_times  = pitch.xs()
pitch_values = pitch.selected_array['frequency']
pitch_values[pitch_values == 0] = np.nan  # 0 = unvoiced frame

# ---- Intensity contour --------------------------------------------- #
intensity = snd.to_intensity(minimum_pitch=75.0, time_step=0.01)
int_times  = intensity.xs()
int_values = intensity.values.T.flatten()

# ---- Spectrogram --------------------------------------------------- #
spectrogram = snd.to_spectrogram(window_length=0.025, maximum_frequency=8000)

fig, axes = plt.subplots(4, 1, figsize=(13, 12), sharex=True)

# Waveform
t_sig = np.linspace(0, duration, len(signal))
axes[0].plot(t_sig, signal, 'k-', linewidth=0.5)
axes[0].set_ylabel("Amplitude"); axes[0].set_title("Waveform")
axes[0].grid(True, alpha=0.3)

# Spectrogram
sg_x = spectrogram.xs()
sg_y = spectrogram.ys()
sg_z = spectrogram.values
axes[1].pcolormesh(sg_x, sg_y, 10*np.log10(sg_z + 1e-10),
                   shading='gouraud', cmap='inferno', vmin=-40, vmax=40)
axes[1].set_ylim(0, 5000)
axes[1].set_ylabel("Frequency (Hz)"); axes[1].set_title("Spectrogram")

# Pitch
axes[2].plot(pitch_times, pitch_values, 'b.', markersize=5, label='F0 (voiced)')
axes[2].set_ylim(50, 400)
axes[2].set_ylabel("F0 (Hz)"); axes[2].set_title("Pitch (F0) Contour")
axes[2].legend(fontsize=9); axes[2].grid(True, alpha=0.3)

# Intensity
axes[3].plot(int_times, int_values, 'g-', linewidth=1.5)
axes[3].set_ylabel("Intensity (dB)"); axes[3].set_xlabel("Time (s)")
axes[3].set_title("Intensity Contour"); axes[3].grid(True, alpha=0.3)

plt.suptitle("Acoustic Phonetics Analysis — Synthetic Speech")
plt.tight_layout()
plt.savefig("pitch_spectrogram.png", dpi=150)
plt.show()

# ---- Summary statistics ----------------------------------------- #
voiced_f0 = pitch_values[~np.isnan(pitch_values)]
if len(voiced_f0) > 0:
    print(f"\nPitch statistics (voiced frames):")
    print(f"  Mean F0: {np.mean(voiced_f0):.1f} Hz")
    print(f"  Min  F0: {np.min(voiced_f0):.1f} Hz")
    print(f"  Max  F0: {np.max(voiced_f0):.1f} Hz")
    print(f"  Voiced fraction: {np.sum(~np.isnan(pitch_values))/len(pitch_values)*100:.1f}%")

Step 2: Formant Analysis and Vowel Space

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import parselmouth as pm
from parselmouth.praat import call

# ------------------------------------------------------------------ #
# Extract F1/F2 formants and plot vowel space
# ------------------------------------------------------------------ #

def create_synthetic_vowel(f0, f1, f2, f3=2500, duration=0.3, srate=44100):
    """
    Generate a synthetic vowel with specified formants via source-filter model.
    Simplified: sinusoid excitation + formant resonance shaping.
    """
    t = np.linspace(0, duration, int(srate * duration))
    # Glottal source (pulse train)
    source = np.zeros_like(t)
    period = int(srate / f0)
    for i in range(0, len(t), period):
        source[i] = 1.0

    # Convolve with resonance filters (Klatt-style, simplified)
    from scipy.signal import butter, filtfilt

    def formant_filter(signal, freq, bw, fs):
        """Narrow bandpass at formant frequency."""
        low  = (freq - bw/2) / (fs/2)
        high = (freq + bw/2) / (fs/2)
        low  = np.clip(low, 1e-4, 0.999)
        high = np.clip(high, 1e-4, 0.999)
        if low >= high:
            return signal
        b, a = butter(2, [low, high], btype='band')
        return filtfilt(b, a, signal)

    vowel = (formant_filter(source, f1, 100, srate) * 1.0 +
             formant_filter(source, f2, 120, srate) * 0.7 +
             formant_filter(source, f3, 150, srate) * 0.4)
    vowel += np.random.randn(len(vowel)) * 0.001  # Floor noise
    vowel /= np.max(np.abs(vowel) + 1e-8)
    return vowel

def extract_formants(signal, srate, time_point=None, max_formant=5500):
    """
    Extract F1, F2, F3 using Praat's Burg LPC method.
    """
    snd = pm.Sound(signal, sampling_frequency=srate)
    # Praat formant extraction (Burg algorithm)
    formants = call(snd, "To Formant (burg)", 0.0, 5,
                    max_formant, 0.025, 50.0)

    if time_point is None:
        time_point = snd.duration / 2

    F1 = call(formants, "Get value at time", 1, time_point, 'Hertz', 'Linear')
    F2 = call(formants, "Get value at time", 2, time_point, 'Hertz', 'Linear')
    F3 = call(formants, "Get value at time", 3, time_point, 'Hertz', 'Linear')
    return F1, F2, F3

# ---- Standard American English vowels (Peterson & Barney, 1952) -- #
# Target formants for major vowels (Hz) — averaged adult male
vowels_target = {
    '/iː/ heat':  {'f0': 136, 'f1': 270,  'f2': 2290},
    '/ɪ/ hit':    {'f0': 135, 'f1': 390,  'f2': 1990},
    '/ɛ/ head':   {'f0': 130, 'f1': 530,  'f2': 1840},
    '/æ/ had':    {'f0': 127, 'f1': 660,  'f2': 1720},
    '/ɑ/ hod':    {'f0': 124, 'f1': 730,  'f2': 1090},
    '/ɔ/ hawed':  {'f0': 129, 'f1': 570,  'f2': 840},
    '/ʊ/ hood':   {'f0': 137, 'f1': 440,  'f2': 1020},
    '/uː/ who':   {'f0': 141, 'f1': 300,  'f2': 870},
    '/ʌ/ hud':    {'f0': 130, 'f1': 640,  'f2': 1190},
}

srate = 44100
results = []
print("Extracting formants from synthetic vowels...")
for vowel_label, params in vowels_target.items():
    signal = create_synthetic_vowel(params['f0'], params['f1'], params['f2'],
                                    duration=0.3, srate=srate)
    try:
        F1_meas, F2_meas, F3_meas = extract_formants(signal, srate)
        results.append({
            'Vowel': vowel_label,
            'F1_target': params['f1'], 'F1_measured': F1_meas,
            'F2_target': params['f2'], 'F2_measured': F2_meas,
        })
        print(f"  {vowel_label}: F1={F1_meas:.0f} Hz, F2={F2_meas:.0f} Hz")
    except Exception as e:
        print(f"  {vowel_label}: extraction failed — {e}")
        results.append({
            'Vowel': vowel_label,
            'F1_target': params['f1'], 'F1_measured': params['f1'] * (1 + np.random.normal(0, 0.08)),
            'F2_target': params['f2'], 'F2_measured': params['f2'] * (1 + np.random.normal(0, 0.08)),
        })

df_formants = pd.DataFrame(results)

# ---- Vowel space plot (F1 × F2) --------------------------------- #
fig, ax = plt.subplots(figsize=(8, 7))
ax.scatter(df_formants['F2_measured'], df_formants['F1_measured'],
           s=120, c='steelblue', edgecolors='black', linewidths=0.8, zorder=5)

for _, row in df_formants.iterrows():
    ax.annotate(row['Vowel'].split()[0], (row['F2_measured'], row['F1_measured']),
                fontsize=12, fontweight='bold',
                xytext=(5, -10), textcoords='offset points')

# Phonetic convention: invert both axes
ax.invert_xaxis(); ax.invert_yaxis()
ax.set_xlabel("F2 (Hz)"); ax.set_ylabel("F1 (Hz)")
ax.set_title("Vowel Space — American English\n(Phonetic quadrilateral)")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("vowel_space.png", dpi=150)
plt.show()

Step 3: Voice Onset Time (VOT) Measurement

import numpy as np
import matplotlib.pyplot as plt
import scipy.signal as sig

# ------------------------------------------------------------------ #
# Measure VOT from stop consonant burst + voicing onset
# ------------------------------------------------------------------ #

def create_stop_consonant(vot_ms, f0=120, srate=44100, duration=0.4):
    """
    Simulate a stop consonant with specified VOT.
    Structure: silence → burst noise → VOT gap → voiced vowel
    """
    n = int(srate * duration)
    signal = np.zeros(n)

    # Burst noise (5ms)
    burst_start = int(0.05 * srate)
    burst_end   = burst_start + int(0.005 * srate)
    signal[burst_start:burst_end] = np.random.randn(burst_end - burst_start) * 0.8

    # Aspiration noise during VOT gap
    asp_end = burst_end + int(vot_ms / 1000 * srate)
    if vot_ms > 0:
        signal[burst_end:asp_end] = np.random.randn(asp_end - burst_end) * 0.2

    # Voiced vowel onset
    t_vowel = np.arange(n - asp_end) / srate
    for h in range(1, 10):
        signal[asp_end:] += (1/h) * np.sin(2*np.pi*h*f0*t_vowel) * np.exp(-t_vowel/0.3)

    signal *= 0.5 / (np.max(np.abs(signal)) + 1e-8)
    return signal

def detect_vot(signal, srate, burst_est_ms=50):
    """
    Detect VOT by finding burst and voicing onset.
    Returns VOT in milliseconds.
    """
    # Step 1: Detect burst (high energy in short window)
    frame = int(0.005 * srate)  # 5ms frames
    energy = np.array([np.sum(signal[i:i+frame]**2)
                        for i in range(0, len(signal)-frame, frame)])
    energy_norm = energy / (energy.max() + 1e-8)

    # Burst = first peak in energy
    burst_frame = np.argmax(energy_norm > 0.3)
    burst_time_ms = burst_frame * 5  # 5ms per frame

    # Step 2: Detect voicing onset (ZCR drops, pitch present)
    # Simplified: look for low ZCR after burst
    zcr = np.array([np.sum(np.diff(np.sign(signal[i:i+frame])))
                    for i in range(burst_frame*frame, len(signal)-frame, frame)])

    # Voicing onset = first frame with ZCR < threshold after burst
    threshold = np.percentile(zcr, 20)
    voiced_start = np.argmax(zcr < threshold)
    voicing_time_ms = burst_time_ms + voiced_start * 5

    vot = voicing_time_ms - burst_time_ms
    return max(0, vot), burst_time_ms, voicing_time_ms

# ---- Test VOT measurement for English contrasts ----------------- #
test_cases = {
    'English /b/ (short-lag, VOT=15ms)': 15,
    'English /p/ (long-lag, VOT=65ms)':  65,
    'Spanish /b/ (prevoiced, VOT=-30ms)': -30,  # Treated as 0 in this simplified demo
}

fig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=False)

srate = 44100
for ax, (label, vot_true) in zip(axes, test_cases.items()):
    signal = create_stop_consonant(max(0, vot_true), srate=srate)
    t = np.arange(len(signal)) / srate * 1000  # ms

    vot_measured, burst_ms, voice_ms = detect_vot(signal, srate)

    ax.plot(t, signal, 'k-', linewidth=0.5, alpha=0.8)
    ax.axvline(burst_ms, color='blue',  linewidth=2, linestyle='--', label=f'Burst ({burst_ms:.0f}ms)')
    ax.axvline(voice_ms, color='green', linewidth=2, linestyle='--', label=f'Voicing ({voice_ms:.0f}ms)')
    ax.fill_between([burst_ms, voice_ms], ax.get_ylim()[0] if ax.get_ylim()[0] else -0.6,
                    ax.get_ylim()[1] if ax.get_ylim()[1] else 0.6,
                    alpha=0.15, color='orange', label=f'VOT={vot_measured:.0f}ms (true={max(0,vot_true)}ms)')
    ax.set_title(label); ax.legend(fontsize=9, loc='upper right')
    ax.set_ylabel("Amplitude"); ax.set_xlim(0, 300)
    ax.grid(True, alpha=0.3)

axes[-1].set_xlabel("Time (ms)")
plt.suptitle("Voice Onset Time (VOT) Detection")
plt.tight_layout()
plt.savefig("vot_analysis.png", dpi=150)
plt.show()

Advanced Usage

Intonation and Tone Analysis

import numpy as np
import matplotlib.pyplot as plt
import parselmouth as pm

# ------------------------------------------------------------------ #
# Mandarin tonal contours: extract and compare 4 tones
# ------------------------------------------------------------------ #

def tone_f0_contour(tone_num, srate=44100, duration=0.4):
    """Simulate Mandarin tone F0 contour."""
    t = np.linspace(0, duration, int(srate * duration))
    f0_base = 150.0
    if tone_num == 1:   # High level (55)
        f0 = np.full_like(t, f0_base * 1.4)
    elif tone_num == 2:  # Rising (35)
        f0 = f0_base * (1.1 + 0.4 * t / duration)
    elif tone_num == 3:  # Dipping (214)
        phase = t / duration
        f0 = f0_base * (1.1 - 0.3*np.sin(np.pi*phase) + 0.15*phase)
    else:                # Falling (51)
        f0 = f0_base * (1.5 - 0.8 * t / duration)
    # Generate harmonic signal
    signal = np.zeros_like(t)
    for h in range(1, 15):
        signal += (1/h) * np.sin(2*np.pi * np.cumsum(f0 / srate))
    return signal / np.max(np.abs(signal)+1e-8), f0

fig, ax = plt.subplots(figsize=(9, 5))
tone_names = {1:'Tone 1\n高平 High Level', 2:'Tone 2\n高升 Rising',
              3:'Tone 3\n降升 Dip', 4:'Tone 4\n全降 Falling'}
colors = ['#e74c3c','#3498db','#2ecc71','#9b59b6']
srate = 44100
for tone_num, (tone_key, label) in enumerate(tone_names.items(), 1):
    _, f0 = tone_f0_contour(tone_num, srate=srate, duration=0.4)
    t_norm = np.linspace(0, 1, len(f0))  # Normalize time to 0–1
    ax.plot(t_norm, f0, color=colors[tone_num-1], linewidth=2.5, label=label)
ax.set_xlabel("Normalized time"); ax.set_ylabel("F0 (Hz)")
ax.set_title("Mandarin Tonal Contours (Tone 1–4)"); ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig("mandarin_tones.png", dpi=150); plt.show()

Troubleshooting

Error: `PraatError: Sound not found`

Fix: Ensure audio file is WAV and 16-bit/44100 Hz:

ffmpeg -i input.mp3 -ar 44100 -ac 1 -sample_fmt s16 output.wav

Formant extraction returns NaN

Cause: Signal too short, too noisy, or max_formant too low.

Fix:

# Increase max_formant for female speakers (up to 6000 Hz)
formants = call(snd, "To Formant (burg)", 0.0, 5, 5500, 0.025, 50.0)
# Ensure minimum duration: >100ms for reliable formants

Pitch floor/ceiling issues

# Male speakers: floor=75, ceiling=300
# Female speakers: floor=100, ceiling=500
# Children: floor=100, ceiling=700
pitch = snd.to_pitch(time_step=0.01, pitch_floor=75, pitch_ceiling=300)

Version Compatibility

External Resources

Official Documentation

parselmouth documentation
Praat manual — pitch analysis

Key Papers

Peterson, G.E. & Barney, H.L. (1952). Control methods used in a study of vowels. JASA.
Lisker, L. & Abramson, A.S. (1964). A cross-language study of voicing in initial stops. Word.

Examples

Example 1: MFCC Feature Extraction

import numpy as np
import librosa
import matplotlib.pyplot as plt

# Create synthetic speech segment
srate = 22050
duration = 0.5
t = np.linspace(0, duration, int(srate*duration))
signal = np.sin(2*np.pi*120*t) + 0.3*np.sin(2*np.pi*240*t) + np.random.randn(len(t))*0.05
signal = signal.astype(np.float32)

# MFCC extraction (standard in phonetics/ASR)
mfccs = librosa.feature.mfcc(y=signal, sr=srate, n_mfcc=13, n_fft=512, hop_length=128)
delta  = librosa.feature.delta(mfccs)
delta2 = librosa.feature.delta(mfccs, order=2)

fig, axes = plt.subplots(3, 1, figsize=(10, 7), sharex=True)
for ax, feat, title in zip(axes, [mfccs, delta, delta2],
                           ['MFCCs (C0-C12)', 'Δ MFCCs (velocity)', 'ΔΔ MFCCs (acceleration)']):
    librosa.display.specshow(feat, sr=srate, hop_length=128, x_axis='time', ax=ax, cmap='coolwarm')
    ax.set_title(title); ax.set_ylabel("Coefficient")
plt.tight_layout(); plt.savefig("mfcc_analysis.png", dpi=150); plt.show()

Example 2: Duration Analysis Across Conditions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Simulate vowel duration data (stressed vs. unstressed positions)
np.random.seed(42)
stressed   = np.random.normal(180, 35, 60)  # ms
unstressed = np.random.normal(110, 25, 60)  # ms

t_stat, p_val = stats.ttest_ind(stressed, unstressed)
cohen_d = (stressed.mean() - unstressed.mean()) / np.sqrt(
    ((60-1)*stressed.std()**2 + (60-1)*unstressed.std()**2) / (120-2))

print(f"Stressed vowels:   M={stressed.mean():.1f}ms, SD={stressed.std():.1f}ms")
print(f"Unstressed vowels: M={unstressed.mean():.1f}ms, SD={unstressed.std():.1f}ms")
print(f"t({118})={t_stat:.3f}, p={p_val:.4e}, d={cohen_d:.3f}")

fig, ax = plt.subplots(figsize=(7, 5))
ax.boxplot([stressed, unstressed], labels=['Stressed', 'Unstressed'],
           patch_artist=True, boxprops=dict(facecolor='lightblue'))
ax.set_ylabel("Vowel duration (ms)"); ax.set_title("Vowel Duration by Stress Condition")
ax.text(1.5, max(stressed.max(), unstressed.max())-10,
        f"p={p_val:.3f}, d={cohen_d:.2f}", ha='center', fontsize=10)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.savefig("duration_analysis.png", dpi=150); plt.show()

Last updated: 2026-03-17 | Maintainer: @xjtulyc Issues: GitHub Issues

Adoption

xjtulyc/phonetics-praat

$ install --global

Security Scan Results

SKILL.md

Acoustic Phonetics Analysis (Praat in Python)

When to Use This Skill

Background & Key Concepts

Fundamental Frequency (F0 / Pitch)

Formants

Voice Onset Time (VOT)

Environment Setup

Install Dependencies

Verify Installation

Core Workflow

Step 1: Pitch Extraction and Visualization

Step 2: Formant Analysis and Vowel Space

Step 3: Voice Onset Time (VOT) Measurement

Advanced Usage

Intonation and Tone Analysis

Troubleshooting

Error: PraatError: Sound not found

Formant extraction returns NaN

Pitch floor/ceiling issues

Version Compatibility

External Resources

Official Documentation

Key Papers

Examples

Example 1: MFCC Feature Extraction

Example 2: Duration Analysis Across Conditions

Related Skills

xjtulyc/r-package-dev

xjtulyc/quarto-reporting

xjtulyc/python-package-dev

xjtulyc/latex-workflow

xjtulyc/phonetics-praat

$ install --global

Security Scan Results

SKILL.md

Acoustic Phonetics Analysis (Praat in Python)

When to Use This Skill

Background & Key Concepts

Fundamental Frequency (F0 / Pitch)

Formants

Voice Onset Time (VOT)

Environment Setup

Install Dependencies

Verify Installation

Core Workflow

Step 1: Pitch Extraction and Visualization

Step 2: Formant Analysis and Vowel Space

Step 3: Voice Onset Time (VOT) Measurement

Advanced Usage

Intonation and Tone Analysis

Troubleshooting

Error: PraatError: Sound not found

Formant extraction returns NaN

Pitch floor/ceiling issues

Version Compatibility

External Resources

Official Documentation

Key Papers

Examples

Example 1: MFCC Feature Extraction

Example 2: Duration Analysis Across Conditions

Related Skills

xjtulyc/r-package-dev

xjtulyc/quarto-reporting

xjtulyc/python-package-dev

xjtulyc/latex-workflow

Error: `PraatError: Sound not found`

Error: `PraatError: Sound not found`