Designing a Diphone Inventory for Bishnupriya Manipuri
Core and extended diphone design for a low-resource TTS system
Publication Information
| Status | First Edition |
|---|---|
| Edition | Web Edition |
| Version | 1.0 |
| Published | 2026-03-10 |
| Last Revised | 2026-03-10 |
| Citation Note | Cite by chapter title, archive title, edition, and year. |
| License / Use | Academic and non-commercial use with attribution. |
1. Introduction
A diphone is a speech unit that spans the transition from the middle of one phoneme to the middle of the next phoneme.
In a diphone-based TTS system, words are not normally stored as complete recordings. Instead, words are assembled from smaller reusable audio units.
Text ↓ IPA ↓ Phoneme sequence ↓ Diphone sequence ↓ Audio concatenation ↓ Speech output
This method is especially useful for Bishnupriya Manipuri because:
- the language has limited speech technology resources
- a full word-level synthesis library would be too large
- a diphone inventory can cover many words efficiently
- the system can be built incrementally
2. What Is a Diphone?
A diphone represents the transition between two adjacent phonemes.
IPA: kɔtʰa
Phonemes:
k ɔ tʰ aDiphones:
#-k k-ɔ ɔ-tʰ tʰ-a a-#
The boundary symbol # represents the beginning or end of a word.
In filenames, this boundary is often converted into a safe symbol such as sil.
3. Why Use Diphones?
The main advantages of a diphone system are:
- small audio database compared with full-word synthesis
- reuse of the same transitions across many words
- good balance between quality and engineering simplicity
- practical for low-resource language projects
A diphone system usually sounds more natural than isolated phoneme concatenation because it preserves the transition between adjacent sounds.
4. From Phoneme Inventory to Diphone Inventory
A diphone inventory is derived from the phoneme inventory of the language.
If a language contains N phonemes, then the theoretical maximum number of diphones is:
N × N
However, many of these combinations do not occur in actual words. Therefore, a practical diphone inventory must be built from real lexical data.
5. Types of Diphones
A well-designed diphone inventory should include several structural categories.
5.1 Boundary-to-phoneme
#-k #-g #-a #-i
These represent word-initial sounds.
5.2 Phoneme-to-boundary
a-# n-# r-#
These represent word-final sounds.
5.3 Consonant-to-vowel
k-a g-i t-u
These are among the most frequent and important diphones.
5.4 Vowel-to-consonant
a-k i-n ɔ-r
These are also essential because syllables often close with consonants.
5.5 Consonant-to-consonant
g-n k-s n-t r-k
These are required for clusters and Sanskritic forms.
5.6 Vowel-to-vowel
a-i a-u i-o
These are needed for words containing vowel sequences, diphthong-like structures, or morpheme junctions.
6. Core vs Extended Diphone Inventory
For practical TTS development, the diphone inventory should be built in phases.
6.1 Core inventory
This layer covers the most common words and phonotactic structures.
- boundary diphones
- common consonant-vowel combinations
- common vowel-consonant combinations
- high-frequency vowel-vowel transitions
Target size:
180–220 diphones
6.2 Extended inventory
This layer adds:
- rare clusters
- learned Sanskrit forms
- borrowed words
- less frequent consonant transitions
Target size:
250–320 diphones
6.3 Rare and exceptional inventory
This layer should be added only after the core system is stable.
- highly marked learned forms
- very uncommon lexical items
- dictionary-only rare transitions
7. Designing Safe Filenames
A diphone inventory must be linked to actual audio files. Therefore, each diphone needs a stable and filesystem-safe filename.
A practical mapping system is:
| IPA | Safe form |
|---|---|
| # | sil |
| aː | aa |
| iː | ii |
| uː | uu |
| ʃ | sh |
| ŋ | ng |
| ɽ | rr |
| ɔ | aw |
| ə | schwa |
#-d → sil-d.wav ʃ-aː → sh-aa.wav aː-# → aa-sil.wav ɔ-r → aw-r.wav
Once this mapping is fixed, it should never be changed during a rebuild. Otherwise old audio files become incompatible.
8. Choosing Source Words for Recording
Diphones should not usually be recorded as isolated syllables. A better method is to record carefully selected whole words, then segment diphones out of those recordings.
A good source word list should:
- cover common phoneme transitions
- include short and long vowels
- include nasal environments
- include common consonant clusters
- include word-initial and word-final contrasts
দিশা মানু কথা অক্ষর অগ্নি অনুরোধ অপমান আকাশ ইচ্ছা উজ্জ্বল একান্ত ঔষধ
9. From Dictionary to Diphone Inventory
The most reliable method of building a diphone inventory is to derive it from the dictionary.
Dictionary word list
↓
IPA conversion
↓
Phoneme extraction
↓
Diphone generation
↓
Unique diphone inventory
This approach ensures that the inventory reflects actual lexical usage, not just theoretical combinations.
It also supports:
- coverage analysis
- missing diphone tracking
- priority-based recording
10. Practical Inventory Example
A practical core diphone inventory for Bishnupriya Manipuri may include entries like:
| Diphone | Safe filename | Priority | Example word |
|---|---|---|---|
| #-k | sil-k.wav | Core | কর |
| k-ɔ | k-aw.wav | Core | কথা |
| ɔ-tʰ | aw-th.wav | Core | কথা |
| tʰ-a | th-a.wav | Core | কথা |
| a-# | a-sil.wav | Core | কথা |
| g-n | g-n.wav | Extended | অগ্নি |
| k-ʃ | k-sh.wav | Extended | অক্ষর |
| ʃ-aː | sh-aa.wav | Core | দিশা |
| i-tʃ | i-ch.wav | Extended | ইচ্ছা |
11. Recording Strategy
A practical recording workflow should follow these stages:
- freeze the IPA and filename rules
- prepare a clean seed word list
- record words in a quiet environment
- normalize all audio to one technical format
- segment diphones from word recordings
- store files in a clean
/audio/diphone/folder - validate coverage
Recommended normalized format:
44.1 kHz mono 16-bit WAV
12. Validation and Rebuild Management
A diphone inventory should always be validated against the actual TTS system. A validator page should check:
- expected diphone sequence
- safe filename sequence
- whether each WAV file exists
- whether old mismatched files remain in the folder
A diphone tracker spreadsheet is also useful for monitoring:
- recorded
- segmented
- uploaded
- validator passed
13. Common Design Mistakes
Several mistakes can damage a diphone inventory rebuild:
- changing IPA rules after audio has already been generated
- changing safe filename rules in the middle of the project
- mixing old and new diphone files in the same folder
- recording directly at diphone level without stable word-level context
- failing to validate with real dictionary words
14. Conclusion
A well-designed diphone inventory is the central resource of a diphone-based Bishnupriya Manipuri TTS system.
It must be:
- derived from a stable phoneme inventory
- built from real lexical data
- recorded through carefully selected source words
- named using a permanent safe filename system
- validated against actual playback
When designed correctly, a relatively small inventory can synthesize a large portion of the language while remaining practical to build and maintain.
Next Article
Article 6 Recording and Normalizing Diphone Audio for Bishnupriya Manipuri TTS
That article will explain how to record source words, normalize audio, remove silence, and prepare the recordings for diphone segmentation.
Index Terms in This Chapter
Suggested Citation
Designing a Diphone Inventory for Bishnupriya Manipuri. Web Edition. Version 1.0. 2026-03-10. Bishnupriya Manipuri Research Archive.