TTS Architecture

A technical overview of the Bishnupriya Manipuri text-to-speech pipeline

Architecture

TTS Architecture

This page explains the technical architecture of the Bishnupriya Manipuri text-to-speech system, showing how dictionary entries are converted into pronunciation data, diphone filenames, and final browser playback.

About this page. The research archive contains article-level explanations and toolkit pages for specific parts of the system. This page brings everything together into one unified technical view of the full TTS pipeline.

1. Full Pipeline Overview

Dictionary word
   ↓
Orthographic normalization
   ↓
BPM to IPA conversion
   ↓
Phoneme tokenization
   ↓
Diphone generation
   ↓
Safe filename mapping
   ↓
Audio file lookup
   ↓
Browser playback
  

Each stage depends on the previous one being stable. If one layer changes unexpectedly, later stages may fail even if their own code is correct.

2. System Goal

The goal of the TTS system is to turn a Bishnupriya Manipuri word into playable speech by combining:

Core design principle:

One shared pronunciation engine should feed every page, validator, and playback path.

3. Layer 1: Dictionary Input

The pipeline begins with a dictionary word record. The dictionary is the lexical foundation of the system and supplies the source form that needs pronunciation and speech output.

Field Use in TTS
Word / BPM form Primary text input
ID Stable lookup key for API and word pages
IPA field (if stored) Can support validation or display
Part of speech / metadata May support future linguistic refinement

4. Layer 2: Orthographic Normalization

Before pronunciation logic runs, the input should be normalized. This reduces errors caused by inconsistent encoding or unexpected character forms.

Raw text
   ↓
Unicode normalization
   ↓
clean internal form
  

This step is especially important for Eastern Nagari text processing.

5. Layer 3: BPM to IPA Conversion

The rule-based converter transforms the written word into an IPA representation. This stage is one of the most important in the whole architecture.

Example:
অক্ষর → ɔkʰʃɔr

The converter must handle:

6. Layer 4: Phoneme Tokenization

Once IPA is generated, the next step is to split it into phoneme units.

Example:
IPA: diʃa
Phonemes: d i ʃ a

This stage must use the same phoneme rules everywhere. If tokenization differs across pages, the diphone sequence will also differ.

7. Layer 5: Diphone Generation

The phoneme sequence is transformed into diphone transitions. These are the units that connect the linguistic layer to the audio layer.

Phonemes: d i ʃ a
Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#
  

Boundary diphones are included so the word has a natural entry and exit in playback.

8. Layer 6: Safe Filename Mapping

Diphone strings are then converted into filesystem-safe names. This lets the audio library use predictable WAV filenames instead of raw IPA symbols.

IPA Form Safe Form Filename
#-dsil-dsil-d.wav
i-ʃi-shi-sh.wav
ʃ-ash-ash-a.wav
a-#a-sila-sil.wav
If safe filename rules change, validators and playback logic must change with them. This is why the safe mapping layer must be frozen before rebuilding the audio library.

9. Layer 7: Audio File Lookup

Once safe filenames are produced, the system checks whether the expected WAV files exist.

sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav
  

If one or more expected files are missing, playback becomes partial or fails. This is where validator tools become essential.

10. Layer 8: Browser Playback

On the client side, JavaScript or browser audio logic loads the diphone files and plays them in sequence.

load filenames
   ↓
request WAV files
   ↓
play in order
   ↓
heard as one synthesized word
  

This stage is the user-facing end of the TTS pipeline, but it depends entirely on the earlier linguistic and file-generation stages being correct.

11. Architecture Diagram by System Layer

Lexical Layer

Dictionary entries and source words.

Phonological Layer

IPA conversion, schwa rules, phoneme inventory, and tokenization.

Transition Layer

Diphone generation and boundary handling.

Mapping Layer

Safe filename conversion and deployment naming standards.

Audio Layer

Diphone WAV inventory, segmentation outputs, and file coverage.

Playback Layer

Browser-side loading, sequencing, and speech playback.

12. Validation Layer Across the Whole System

Validation is not a separate afterthought. It cuts across the whole architecture.

Dictionary word
   ↓
IPA check
   ↓
Phoneme check
   ↓
Diphone check
   ↓
Filename check
   ↓
Audio file existence check
   ↓
Playback check
  

A good validator can show exactly where the pipeline breaks.

13. Common Failure Points

Rule Drift

IPA output changes in one page but not another.

Tokenization Drift

One tokenizer splits sounds differently than the stable shared one.

Diphone Drift

The expected diphone sequence no longer matches the audio library.

Filename Drift

Safe filename rules have changed but audio files still use the old form.

Old File Pollution

Old diphone files remain mixed with rebuilt files.

Playback Mismatch

Browser-side code asks for filenames that do not exist.

14. Recommended Architectural Rule

One shared conversion core should feed:
  • dictionary word pages
  • batch tools
  • validator pages
  • diphone tools
  • TTS playback pages

This reduces mismatch and makes the system much easier to maintain.

15. Practical End-to-End Example

Word: দিশা
Input word: দিশা

IPA:
diʃa

Phonemes:
d i ʃ a

Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#

Safe filenames:
sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

Playback:
load and play these 5 files in sequence
  

16. Related Archive Pages

Architecture note. This page should serve as the central technical overview of the project. As the system evolves, it can be expanded with diagrams, API references, code-layer notes, and audio flow documentation.