Spelly Studio is the world's first text-to-speech engine that measures its own pronunciation accuracy at the phoneme level. Generate multilingual speech and get a detailed quality score — not just "does it sound good?", but "did it pronounce every sound correctly?"
Most TTS engines generate audio and hope it sounds right. Spelly Studio proves it is right by comparing what the synthesizer should have said with what the acoustic model hears it actually say.
Traditional TTS evaluation relies on human listening (MOS) or signal-level metrics (PESQ, STOI). Spelly Studio evaluates at the phoneme level - the actual building blocks of speech.
Did the TTS produce the correct phoneme for each grapheme? We compare the reference phoneme sequence (from the text) against the detected phonemes (from the audio) and score per-phoneme correctness.
Did the TTS drop any sounds? We measure insertions, deletions, and substitutions in the phoneme alignment. A missing consonant cluster or swallowed vowel is caught and scored.
Beyond individual phonemes, we measure the overall intelligibility of the utterance and the naturalness of the speech rate. Is the output actually understandable?
For now we support English (US/GB), German, Spanish, French, Italian, Catalan. Language-specific phoneme inventories, diphthong rules, and respelling systems. We will continue to expand our language support in the future.
No GPU required. The entire pipeline — TTS synthesis + acoustic model inference + phoneme alignment + scoring — runs on CPU in milliseconds. Ideal for high-throughput batch processing.
Not just a global score. Every word gets its own accuracy rating, phoneme-level error analysis, and confidence plot. Identify exactly which words your TTS struggles with.
Spelly Studio is designed for teams who need objective, repeatable, automated pronunciation quality assessment.
You ship a TTS model. You need to know if version 2.1 pronounces "colonel" better than 2.0. Run both through Spelly Studio, get a phoneme-level diff, and quantify the improvement.
Generate thousands of pronunciation examples for language learning apps. Use Spelly Studio to filter out low-quality syntheses before they reach students. Only ship audio with >85% phoneme accuracy.
Produce AI-narrated audiobooks at scale. Automatically flag chapters with pronunciation errors (e.g., foreign names, technical terms) for human review. Catch "Hermione" mispronounced before it ships.
Your voice assistant speaks street names, restaurant names, and medical terms. Spelly Studio tells you which terms it butchers so you can add phonetic respellings or switch to a better voice model.
Benchmarking new TTS architectures? Spelly Studio gives you a phoneme-level objective metric to compare against baselines. Report accuracy, completeness, and fluency scores alongside MOS scores in your paper.
Your app supports more than 1 language. Did your TTS vendor actually train on German umlauts, Spanish trilled R, and French nasal vowels? Spelly Studio evaluates each language independently with language-specific phoneme scoring.
Subjective listening tests don't scale. Signal-level metrics don't understand speech. Spelly Studio bridges the gap with linguistically-aware, objective scoring.
| Metric | MOS (Human Listening) | PESQ / STOI | Spelly Studio |
|---|---|---|---|
| Scalable | ✗ Expensive, slow | ✓ Automated | ✓ Fully automated |
| Phoneme-level | ✗ Listener impression only | ✗ Signal-level | ✓ Per-phoneme accuracy |
| Per-word scoring | ✗ Not feasible | ✗ Not supported | ✓ Every word scored |
| Detects omissions | ✗ Inconsistent | ✗ Signal may still match | ✓ Catches deletions |
| Multilingual | ✗ Needs native speakers | ✗ Not aware of the language nuances | ✓ Multiple languages via IPA approach |
| Repeatable | ✗ Inter-rater variance | ✓ Deterministic | ✓ Deterministic |
| Cost | $$$ per hour of audio | $ CPU compute | $ CPU compute |
Spelly Studio is built on the same production-grade infrastructure that powers spelly.online — a pronunciation learning platform with 500+ lessons and real-time speech assessment.
Multilingual fine-tuned for phoneme recognition, architecturally optimized for fast CPU inference.
Forced alignment between reference phonemes (from text) and detected phonemes (from audio). Handles insertions, deletions, and substitutions with per-phoneme confidence scores.
Everything is represented in International Phonetic Alphabet. No language-specific encoding needed since the same pipeline works for multiple languages.
Developed in collaboration with ZHAW (Zurich University of Applied Sciences). Swiss precision in research, engineering, and quality assurance.
We're currently exploring partnerships with vendors, e-learning platforms, and content studios. If you need pronunciation-aware quality assessment at scale, let's talk.
vernaraglobal@gmail.com spelly.online