Designing a Diphone Inventory for Bishnupriya Manipuri

Core and extended diphone design for a low-resource TTS system

Part of the Bishnupriya Manipuri speech technology series

Author: Uttam Singha

Publication Information

First EditionWeb EditionVersion 1.0
StatusFirst Edition
EditionWeb Edition
Version1.0
Published2026-03-10
Last Revised2026-03-10
Citation NoteCite by chapter title, archive title, edition, and year.
License / UseAcademic and non-commercial use with attribution.
Abstract. A diphone inventory is the core audio resource of a diphone-based text-to-speech system. Instead of recording every word in a language, a diphone system stores transitions between adjacent phonemes and reconstructs speech by concatenating those transitions. This article explains how to design a practical diphone inventory for Bishnupriya Manipuri, how to reduce unnecessary combinations, how to create stable safe filenames, and how to build a reusable recording strategy for a low-resource language TTS system.

1. Introduction

A diphone is a speech unit that spans the transition from the middle of one phoneme to the middle of the next phoneme.

In a diphone-based TTS system, words are not normally stored as complete recordings. Instead, words are assembled from smaller reusable audio units.

Text
  ↓
IPA
  ↓
Phoneme sequence
  ↓
Diphone sequence
  ↓
Audio concatenation
  ↓
Speech output

This method is especially useful for Bishnupriya Manipuri because:

2. What Is a Diphone?

A diphone represents the transition between two adjacent phonemes.

Example word: কথা
IPA: kɔtʰa
Phonemes:
k ɔ tʰ a
Diphones:
#-k
k-ɔ
ɔ-tʰ
tʰ-a
a-#

The boundary symbol # represents the beginning or end of a word. In filenames, this boundary is often converted into a safe symbol such as sil.

3. Why Use Diphones?

The main advantages of a diphone system are:

A diphone system usually sounds more natural than isolated phoneme concatenation because it preserves the transition between adjacent sounds.

4. From Phoneme Inventory to Diphone Inventory

A diphone inventory is derived from the phoneme inventory of the language.

If a language contains N phonemes, then the theoretical maximum number of diphones is:

N × N

However, many of these combinations do not occur in actual words. Therefore, a practical diphone inventory must be built from real lexical data.

For example, if a practical Bishnupriya Manipuri inventory uses around 30 phonemes, the theoretical maximum is 900 diphones, but the real usable inventory may only require about 200–300.

5. Types of Diphones

A well-designed diphone inventory should include several structural categories.

5.1 Boundary-to-phoneme

#-k
#-g
#-a
#-i

These represent word-initial sounds.

5.2 Phoneme-to-boundary

a-#
n-#
r-#

These represent word-final sounds.

5.3 Consonant-to-vowel

k-a
g-i
t-u

These are among the most frequent and important diphones.

5.4 Vowel-to-consonant

a-k
i-n
ɔ-r

These are also essential because syllables often close with consonants.

5.5 Consonant-to-consonant

g-n
k-s
n-t
r-k

These are required for clusters and Sanskritic forms.

5.6 Vowel-to-vowel

a-i
a-u
i-o

These are needed for words containing vowel sequences, diphthong-like structures, or morpheme junctions.

6. Core vs Extended Diphone Inventory

For practical TTS development, the diphone inventory should be built in phases.

6.1 Core inventory

This layer covers the most common words and phonotactic structures.

Target size:

180–220 diphones

6.2 Extended inventory

This layer adds:

Target size:

250–320 diphones

6.3 Rare and exceptional inventory

This layer should be added only after the core system is stable.

7. Designing Safe Filenames

A diphone inventory must be linked to actual audio files. Therefore, each diphone needs a stable and filesystem-safe filename.

A practical mapping system is:

IPA Safe form
#sil
aa
ii
uu
ʃsh
ŋng
ɽrr
ɔaw
əschwa
Examples:
#-d   → sil-d.wav
ʃ-aː  → sh-aa.wav
aː-#  → aa-sil.wav
ɔ-r   → aw-r.wav

Once this mapping is fixed, it should never be changed during a rebuild. Otherwise old audio files become incompatible.

8. Choosing Source Words for Recording

Diphones should not usually be recorded as isolated syllables. A better method is to record carefully selected whole words, then segment diphones out of those recordings.

A good source word list should:

Examples of useful seed words:
দিশা
মানু
কথা
অক্ষর
অগ্নি
অনুরোধ
অপমান
আকাশ
ইচ্ছা
উজ্জ্বল
একান্ত
ঔষধ

9. From Dictionary to Diphone Inventory

The most reliable method of building a diphone inventory is to derive it from the dictionary.

Dictionary word list
        ↓
IPA conversion
        ↓
Phoneme extraction
        ↓
Diphone generation
        ↓
Unique diphone inventory

This approach ensures that the inventory reflects actual lexical usage, not just theoretical combinations.

It also supports:

10. Practical Inventory Example

A practical core diphone inventory for Bishnupriya Manipuri may include entries like:

Diphone Safe filename Priority Example word
#-ksil-k.wavCoreকর
k-ɔk-aw.wavCoreকথা
ɔ-tʰaw-th.wavCoreকথা
tʰ-ath-a.wavCoreকথা
a-#a-sil.wavCoreকথা
g-ng-n.wavExtendedঅগ্নি
k-ʃk-sh.wavExtendedঅক্ষর
ʃ-aːsh-aa.wavCoreদিশা
i-tʃi-ch.wavExtendedইচ্ছা

11. Recording Strategy

A practical recording workflow should follow these stages:

  1. freeze the IPA and filename rules
  2. prepare a clean seed word list
  3. record words in a quiet environment
  4. normalize all audio to one technical format
  5. segment diphones from word recordings
  6. store files in a clean /audio/diphone/ folder
  7. validate coverage

Recommended normalized format:

44.1 kHz
mono
16-bit WAV

12. Validation and Rebuild Management

A diphone inventory should always be validated against the actual TTS system. A validator page should check:

A diphone tracker spreadsheet is also useful for monitoring:

13. Common Design Mistakes

Several mistakes can damage a diphone inventory rebuild:

A clean rebuild should begin by backing up the old diphone folder, creating a new empty folder, and populating it only with files generated under the current stable rule system.

14. Conclusion

A well-designed diphone inventory is the central resource of a diphone-based Bishnupriya Manipuri TTS system.

It must be:

When designed correctly, a relatively small inventory can synthesize a large portion of the language while remaining practical to build and maintain.

Next Article

Article 6
Recording and Normalizing Diphone Audio for Bishnupriya Manipuri TTS

That article will explain how to record source words, normalize audio, remove silence, and prepare the recordings for diphone segmentation.

Index Terms in This Chapter

Suggested Citation

Designing a Diphone Inventory for Bishnupriya Manipuri. Web Edition. Version 1.0. 2026-03-10. Bishnupriya Manipuri Research Archive.