First pronunciation-aware TTS engine

Generate speech.
Know it's right.

Spelly Studio is the world's first text-to-speech engine that measures its own pronunciation accuracy at the phoneme level. Generate multilingual speech and get a detailed quality score — not just "does it sound good?", but "did it pronounce every sound correctly?"

Live Pronunciation Report
Accuracy
92%
Completeness
88%
Fluency
85%
Intelligibility
90%
Hello /həˈloʊ/
Detected
/h/ /ə/ /l/ /u/
Correct Mispronunciation Insertion Omission

The Pronunciation-Aware Pipeline

Most TTS engines generate audio and hope it sounds right. Spelly Studio proves it is right by comparing what the synthesizer should have said with what the acoustic model hears it actually say.

Pipeline 1: Text-to-Speech Generation
Input Text "Hello world" Phonemization IPA phoneme sequence TTS Synthesis 7 languages, CPU-fast 16 kHz WAV output Audio Output WAV ready for playback SOURCE REFERENCE GENERATION OUTPUT
Pipeline 2: Pronunciation Quality Assessment
TTS Audio 16 kHz PCM WAV Acoustic Model Multilingual detector Detected Phonemes CTC decoder IPA sequence Forced Alignment CTC alignment per-word scoring 4 metrics + phonemes Reference vs. Detected INPUT ML INFERENCE DECODING SCORING

What Spelly Studio Checks

Traditional TTS evaluation relies on human listening (MOS) or signal-level metrics (PESQ, STOI). Spelly Studio evaluates at the phoneme level - the actual building blocks of speech.

Phoneme Accuracy

Did the TTS produce the correct phoneme for each grapheme? We compare the reference phoneme sequence (from the text) against the detected phonemes (from the audio) and score per-phoneme correctness.

Completeness

Did the TTS drop any sounds? We measure insertions, deletions, and substitutions in the phoneme alignment. A missing consonant cluster or swallowed vowel is caught and scored.

Fluency & Intelligibility

Beyond individual phonemes, we measure the overall intelligibility of the utterance and the naturalness of the speech rate. Is the output actually understandable?

Multilingual Coverage

For now we support English (US/GB), German, Spanish, French, Italian, Catalan. Language-specific phoneme inventories, diphthong rules, and respelling systems. We will continue to expand our language support in the future.

CPU-Fast Inference

No GPU required. The entire pipeline — TTS synthesis + acoustic model inference + phoneme alignment + scoring — runs on CPU in milliseconds. Ideal for high-throughput batch processing.

Per-Word Breakdown

Not just a global score. Every word gets its own accuracy rating, phoneme-level error analysis, and confidence plot. Identify exactly which words your TTS struggles with.

Use Cases

Spelly Studio is designed for teams who need objective, repeatable, automated pronunciation quality assessment.

TTS Vendor Testing

You ship a TTS model. You need to know if version 2.1 pronounces "colonel" better than 2.0. Run both through Spelly Studio, get a phoneme-level diff, and quantify the improvement.

E-Learning Content Production

Generate thousands of pronunciation examples for language learning apps. Use Spelly Studio to filter out low-quality syntheses before they reach students. Only ship audio with >85% phoneme accuracy.

Audiobook Production

Produce AI-narrated audiobooks at scale. Automatically flag chapters with pronunciation errors (e.g., foreign names, technical terms) for human review. Catch "Hermione" mispronounced before it ships.

Voice Assistant Tuning

Your voice assistant speaks street names, restaurant names, and medical terms. Spelly Studio tells you which terms it butchers so you can add phonetic respellings or switch to a better voice model.

Speech Research

Benchmarking new TTS architectures? Spelly Studio gives you a phoneme-level objective metric to compare against baselines. Report accuracy, completeness, and fluency scores alongside MOS scores in your paper.

Localization Testing

Your app supports more than 1 language. Did your TTS vendor actually train on German umlauts, Spanish trilled R, and French nasal vowels? Spelly Studio evaluates each language independently with language-specific phoneme scoring.

TTS Evaluation: The Old Way vs. Spelly Studio

Subjective listening tests don't scale. Signal-level metrics don't understand speech. Spelly Studio bridges the gap with linguistically-aware, objective scoring.

Metric MOS (Human Listening) PESQ / STOI Spelly Studio
Scalable ✗ Expensive, slow ✓ Automated ✓ Fully automated
Phoneme-level ✗ Listener impression only ✗ Signal-level ✓ Per-phoneme accuracy
Per-word scoring ✗ Not feasible ✗ Not supported ✓ Every word scored
Detects omissions ✗ Inconsistent ✗ Signal may still match ✓ Catches deletions
Multilingual ✗ Needs native speakers ✗ Not aware of the language nuances ✓ Multiple languages via IPA approach
Repeatable ✗ Inter-rater variance ✓ Deterministic ✓ Deterministic
Cost $$$ per hour of audio $ CPU compute $ CPU compute

Technical Architecture

Spelly Studio is built on the same production-grade infrastructure that powers spelly.online — a pronunciation learning platform with 500+ lessons and real-time speech assessment.

Acoustic Model

Multilingual fine-tuned for phoneme recognition, architecturally optimized for fast CPU inference.

CTC Alignment

Forced alignment between reference phonemes (from text) and detected phonemes (from audio). Handles insertions, deletions, and substitutions with per-phoneme confidence scores.

IPA Lingua Franca

Everything is represented in International Phonetic Alphabet. No language-specific encoding needed since the same pipeline works for multiple languages.

Swiss Engineering

Developed in collaboration with ZHAW (Zurich University of Applied Sciences). Swiss precision in research, engineering, and quality assurance.

Interested in Spelly Studio?

We're currently exploring partnerships with vendors, e-learning platforms, and content studios. If you need pronunciation-aware quality assessment at scale, let's talk.

vernaraglobal@gmail.com spelly.online