TTS Architecture

About this page. The research archive contains article-level explanations and toolkit pages for specific parts of the system. This page brings everything together into one unified technical view of the full TTS pipeline.

1. Full Pipeline Overview

Dictionary word
   ↓
Orthographic normalization
   ↓
BPM to IPA conversion
   ↓
Phoneme tokenization
   ↓
Diphone generation
   ↓
Safe filename mapping
   ↓
Audio file lookup
   ↓
Browser playback

Each stage depends on the previous one being stable. If one layer changes unexpectedly, later stages may fail even if their own code is correct.

2. System Goal

The goal of the TTS system is to turn a Bishnupriya Manipuri word into playable speech by combining:

dictionary data
rule-based pronunciation logic
phoneme sequencing
diphone audio files
browser-side playback

Core design principle:

One shared pronunciation engine should feed every page, validator, and playback path.

3. Layer 1: Dictionary Input

The pipeline begins with a dictionary word record. The dictionary is the lexical foundation of the system and supplies the source form that needs pronunciation and speech output.

Field	Use in TTS
Word / BPM form	Primary text input
ID	Stable lookup key for API and word pages
IPA field (if stored)	Can support validation or display
Part of speech / metadata	May support future linguistic refinement

4. Layer 2: Orthographic Normalization

Before pronunciation logic runs, the input should be normalized. This reduces errors caused by inconsistent encoding or unexpected character forms.

Raw text
   ↓
Unicode normalization
   ↓
clean internal form

This step is especially important for Eastern Nagari text processing.

5. Layer 3: BPM to IPA Conversion

The rule-based converter transforms the written word into an IPA representation. This stage is one of the most important in the whole architecture.

Example:
অক্ষর → ɔkʰʃɔr

The converter must handle:

basic consonant and vowel mapping
cluster-sensitive pronunciation
schwa behavior
learned or exceptional forms

6. Layer 4: Phoneme Tokenization

Once IPA is generated, the next step is to split it into phoneme units.

Example:
IPA: diʃa
Phonemes: d i ʃ a

This stage must use the same phoneme rules everywhere. If tokenization differs across pages, the diphone sequence will also differ.

7. Layer 5: Diphone Generation

The phoneme sequence is transformed into diphone transitions. These are the units that connect the linguistic layer to the audio layer.

Phonemes: d i ʃ a
Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#

Boundary diphones are included so the word has a natural entry and exit in playback.

8. Layer 6: Safe Filename Mapping

Diphone strings are then converted into filesystem-safe names. This lets the audio library use predictable WAV filenames instead of raw IPA symbols.

IPA Form	Safe Form	Filename
#-d	sil-d	sil-d.wav
i-ʃ	i-sh	i-sh.wav
ʃ-a	sh-a	sh-a.wav
a-#	a-sil	a-sil.wav

If safe filename rules change, validators and playback logic must change with them. This is why the safe mapping layer must be frozen before rebuilding the audio library.

9. Layer 7: Audio File Lookup

Once safe filenames are produced, the system checks whether the expected WAV files exist.

sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

If one or more expected files are missing, playback becomes partial or fails. This is where validator tools become essential.

10. Layer 8: Browser Playback

On the client side, JavaScript or browser audio logic loads the diphone files and plays them in sequence.

load filenames
   ↓
request WAV files
   ↓
play in order
   ↓
heard as one synthesized word

This stage is the user-facing end of the TTS pipeline, but it depends entirely on the earlier linguistic and file-generation stages being correct.

11. Architecture Diagram by System Layer

Lexical Layer

Dictionary entries and source words.

Phonological Layer

IPA conversion, schwa rules, phoneme inventory, and tokenization.

Transition Layer

Diphone generation and boundary handling.

Mapping Layer

Safe filename conversion and deployment naming standards.

Audio Layer

Diphone WAV inventory, segmentation outputs, and file coverage.

Playback Layer

Browser-side loading, sequencing, and speech playback.

12. Validation Layer Across the Whole System

Validation is not a separate afterthought. It cuts across the whole architecture.

Dictionary word
   ↓
IPA check
   ↓
Phoneme check
   ↓
Diphone check
   ↓
Filename check
   ↓
Audio file existence check
   ↓
Playback check

A good validator can show exactly where the pipeline breaks.

13. Common Failure Points

Rule Drift

IPA output changes in one page but not another.

Tokenization Drift

One tokenizer splits sounds differently than the stable shared one.

Diphone Drift

The expected diphone sequence no longer matches the audio library.

Filename Drift

Safe filename rules have changed but audio files still use the old form.

Old File Pollution

Old diphone files remain mixed with rebuilt files.

Playback Mismatch

Browser-side code asks for filenames that do not exist.

14. Recommended Architectural Rule

One shared conversion core should feed:

dictionary word pages
batch tools
validator pages
diphone tools
TTS playback pages

This reduces mismatch and makes the system much easier to maintain.

15. Practical End-to-End Example

Word: দিশা

Input word: দিশা

IPA:
diʃa

Phonemes:
d i ʃ a

Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#

Safe filenames:
sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

Playback:
load and play these 5 files in sequence

16. Related Archive Pages

IPA Toolkit

Review the pronunciation logic and orthography-to-IPA rules.

Open IPA Toolkit →

Diphone Inventory

Review the transition units used in synthesis.

Open Diphone Inventory →

Safe Filename Mapping

Review how IPA diphones become deployment-ready filenames.

Open Safe Filename Mapping →

Validator Workflow

Review how the pipeline is tested and checked.

Open Validator Workflow →

Recording Protocol

Review the source audio workflow that supports the architecture.

Open Recording Protocol →

Rebuild Checklist

Review the operational checklist for clean rebuild and deployment.

Open Rebuild Checklist →

Architecture note. This page should serve as the central technical overview of the project. As the system evolves, it can be expanded with diagrams, API references, code-layer notes, and audio flow documentation.

Bishnupriya Manipuri Research Archive

Language, linguistics, dictionary, IPA, phonemes, diphones, and speech technology

TTS Architecture