Spelly Studio — Pronunciation-Aware TTS Quality Engine

How It Works

The Pronunciation-Aware Pipeline

Most TTS engines generate audio and hope it sounds right. Spelly Studio proves it is right by comparing what the synthesizer should have said with what the acoustic model hears it actually say.

Pipeline 1: Text-to-Speech Generation

Pipeline 2: Pronunciation Quality Assessment

What It Checks

What Spelly Studio Checks

Traditional TTS evaluation relies on human listening (MOS) or signal-level metrics (PESQ, STOI). Spelly Studio evaluates at the phoneme level - the actual building blocks of speech.

Phoneme Accuracy

Did the TTS produce the correct phoneme for each grapheme? We compare the reference phoneme sequence (from the text) against the detected phonemes (from the audio) and score per-phoneme correctness.

Completeness

Did the TTS drop any sounds? We measure insertions, deletions, and substitutions in the phoneme alignment. A missing consonant cluster or swallowed vowel is caught and scored.

Fluency & Intelligibility

Beyond individual phonemes, we measure the overall intelligibility of the utterance and the naturalness of the speech rate. Is the output actually understandable?

Multilingual Coverage

For now we support English (US/GB), German, Spanish, French, Italian, Catalan. Language-specific phoneme inventories, diphthong rules, and respelling systems. We will continue to expand our language support in the future.

CPU-Fast Inference

No GPU required. The entire pipeline — TTS synthesis + acoustic model inference + phoneme alignment + scoring — runs on CPU in milliseconds. Ideal for high-throughput batch processing.

Per-Word Breakdown

Not just a global score. Every word gets its own accuracy rating, phoneme-level error analysis, and confidence plot. Identify exactly which words your TTS struggles with.

Who It's For

Use Cases

Spelly Studio is designed for teams who need objective, repeatable, automated pronunciation quality assessment.

You ship a TTS model. You need to know if version 2.1 pronounces "colonel" better than 2.0. Run both through Spelly Studio, get a phoneme-level diff, and quantify the improvement.

Generate thousands of pronunciation examples for language learning apps. Use Spelly Studio to filter out low-quality syntheses before they reach students. Only ship audio with >85% phoneme accuracy.

Produce AI-narrated audiobooks at scale. Automatically flag chapters with pronunciation errors (e.g., foreign names, technical terms) for human review. Catch "Hermione" mispronounced before it ships.

Your voice assistant speaks street names, restaurant names, and medical terms. Spelly Studio tells you which terms it butchers so you can add phonetic respellings or switch to a better voice model.

Benchmarking new TTS architectures? Spelly Studio gives you a phoneme-level objective metric to compare against baselines. Report accuracy, completeness, and fluency scores alongside MOS scores in your paper.

Your app supports more than 1 language. Did your TTS vendor actually train on German umlauts, Spanish trilled R, and French nasal vowels? Spelly Studio evaluates each language independently with language-specific phoneme scoring.

Why This Matters

TTS Evaluation: The Old Way vs. Spelly Studio

Subjective listening tests don't scale. Signal-level metrics don't understand speech. Spelly Studio bridges the gap with linguistically-aware, objective scoring.

Metric	MOS (Human Listening)	PESQ / STOI	Spelly Studio
Scalable	✗ Expensive, slow	✓ Automated	✓ Fully automated
Phoneme-level	✗ Listener impression only	✗ Signal-level	✓ Per-phoneme accuracy
Per-word scoring	✗ Not feasible	✗ Not supported	✓ Every word scored
Detects omissions	✗ Inconsistent	✗ Signal may still match	✓ Catches deletions
Multilingual	✗ Needs native speakers	✗ Not aware of the language nuances	✓ Multiple languages via IPA approach
Repeatable	✗ Inter-rater variance	✓ Deterministic	✓ Deterministic
Cost	$$$ per hour of audio	$ CPU compute	$ CPU compute

Under the Hood

Technical Architecture

Spelly Studio is built on the same production-grade infrastructure that powers spelly.online — a pronunciation learning platform with 500+ lessons and real-time speech assessment.

Acoustic Model

Multilingual fine-tuned for phoneme recognition, architecturally optimized for fast CPU inference.

CTC Alignment

Forced alignment between reference phonemes (from text) and detected phonemes (from audio). Handles insertions, deletions, and substitutions with per-phoneme confidence scores.

IPA Lingua Franca

Everything is represented in International Phonetic Alphabet. No language-specific encoding needed since the same pipeline works for multiple languages.

Swiss Engineering

Developed in collaboration with ZHAW (Zurich University of Applied Sciences). Swiss precision in research, engineering, and quality assurance.

Interested in Spelly Studio?

We're currently exploring partnerships with vendors, e-learning platforms, and content studios. If you need pronunciation-aware quality assessment at scale, let's talk.

vernaraglobal@gmail.com spelly.online

Generate speech.
Know it's right.

The Pronunciation-Aware Pipeline

What Spelly Studio Checks

Phoneme Accuracy

Completeness

Fluency & Intelligibility

Multilingual Coverage

CPU-Fast Inference

Per-Word Breakdown

Use Cases

TTS Vendor Testing

E-Learning Content Production

Audiobook Production

Voice Assistant Tuning

Speech Research

Localization Testing

TTS Evaluation: The Old Way vs. Spelly Studio

Technical Architecture

Acoustic Model

CTC Alignment

IPA Lingua Franca

Swiss Engineering

Interested in Spelly Studio?

Generate speech. Know it's right.

The Pronunciation-Aware Pipeline

What Spelly Studio Checks

Phoneme Accuracy

Completeness

Fluency & Intelligibility

Multilingual Coverage

CPU-Fast Inference

Per-Word Breakdown

Use Cases

TTS Vendor Testing

E-Learning Content Production

Audiobook Production

Voice Assistant Tuning

Speech Research

Localization Testing

TTS Evaluation: The Old Way vs. Spelly Studio

Technical Architecture

Acoustic Model

CTC Alignment

IPA Lingua Franca

Swiss Engineering

Interested in Spelly Studio?

Generate speech.
Know it's right.