Designing a Diphone Inventory for Bishnupriya Manipuri

Core and extended diphone design for a low-resource TTS system

Part of the Bishnupriya Manipuri speech technology series

Author: Uttam Singha

Publication Information

First EditionWeb EditionVersion 1.0

Status	First Edition
Edition	Web Edition
Version	1.0
Published	2026-03-10
Last Revised	2026-03-10
Citation Note	Cite by chapter title, archive title, edition, and year.
License / Use	Academic and non-commercial use with attribution.

Abstract. A diphone inventory is the core audio resource of a diphone-based text-to-speech system. Instead of recording every word in a language, a diphone system stores transitions between adjacent phonemes and reconstructs speech by concatenating those transitions. This article explains how to design a practical diphone inventory for Bishnupriya Manipuri, how to reduce unnecessary combinations, how to create stable safe filenames, and how to build a reusable recording strategy for a low-resource language TTS system.

1. Introduction

A diphone is a speech unit that spans the transition from the middle of one phoneme to the middle of the next phoneme.

In a diphone-based TTS system, words are not normally stored as complete recordings. Instead, words are assembled from smaller reusable audio units.

Text
  ↓
IPA
  ↓
Phoneme sequence
  ↓
Diphone sequence
  ↓
Audio concatenation
  ↓
Speech output

This method is especially useful for Bishnupriya Manipuri because:

the language has limited speech technology resources
a full word-level synthesis library would be too large
a diphone inventory can cover many words efficiently
the system can be built incrementally

2. What Is a Diphone?

A diphone represents the transition between two adjacent phonemes.

Example word: কথা
IPA: kɔtʰa
Phonemes:

k ɔ tʰ a

Diphones:

#-k
k-ɔ
ɔ-tʰ
tʰ-a
a-#

The boundary symbol # represents the beginning or end of a word. In filenames, this boundary is often converted into a safe symbol such as sil.

3. Why Use Diphones?

The main advantages of a diphone system are:

small audio database compared with full-word synthesis
reuse of the same transitions across many words
good balance between quality and engineering simplicity
practical for low-resource language projects

A diphone system usually sounds more natural than isolated phoneme concatenation because it preserves the transition between adjacent sounds.

4. From Phoneme Inventory to Diphone Inventory

A diphone inventory is derived from the phoneme inventory of the language.

If a language contains N phonemes, then the theoretical maximum number of diphones is:

N × N

However, many of these combinations do not occur in actual words. Therefore, a practical diphone inventory must be built from real lexical data.

For example, if a practical Bishnupriya Manipuri inventory uses around 30 phonemes, the theoretical maximum is 900 diphones, but the real usable inventory may only require about 200–300.

5. Types of Diphones

A well-designed diphone inventory should include several structural categories.

5.1 Boundary-to-phoneme

#-k
#-g
#-a
#-i

These represent word-initial sounds.

5.2 Phoneme-to-boundary

a-#
n-#
r-#

These represent word-final sounds.

5.3 Consonant-to-vowel

k-a
g-i
t-u

These are among the most frequent and important diphones.

5.4 Vowel-to-consonant

a-k
i-n
ɔ-r

These are also essential because syllables often close with consonants.

5.5 Consonant-to-consonant

g-n
k-s
n-t
r-k

These are required for clusters and Sanskritic forms.

5.6 Vowel-to-vowel

a-i
a-u
i-o

These are needed for words containing vowel sequences, diphthong-like structures, or morpheme junctions.

6. Core vs Extended Diphone Inventory

For practical TTS development, the diphone inventory should be built in phases.

6.1 Core inventory

This layer covers the most common words and phonotactic structures.

boundary diphones
common consonant-vowel combinations
common vowel-consonant combinations
high-frequency vowel-vowel transitions

Target size:

180–220 diphones

6.2 Extended inventory

This layer adds:

rare clusters
learned Sanskrit forms
borrowed words
less frequent consonant transitions

Target size:

250–320 diphones

6.3 Rare and exceptional inventory

This layer should be added only after the core system is stable.

highly marked learned forms
very uncommon lexical items
dictionary-only rare transitions

7. Designing Safe Filenames

A diphone inventory must be linked to actual audio files. Therefore, each diphone needs a stable and filesystem-safe filename.

A practical mapping system is:

IPA	Safe form
#	sil
aː	aa
iː	ii
uː	uu
ʃ	sh
ŋ	ng
ɽ	rr
ɔ	aw
ə	schwa

Examples:

#-d   → sil-d.wav
ʃ-aː  → sh-aa.wav
aː-#  → aa-sil.wav
ɔ-r   → aw-r.wav

Once this mapping is fixed, it should never be changed during a rebuild. Otherwise old audio files become incompatible.

8. Choosing Source Words for Recording

Diphones should not usually be recorded as isolated syllables. A better method is to record carefully selected whole words, then segment diphones out of those recordings.

A good source word list should:

cover common phoneme transitions
include short and long vowels
include nasal environments
include common consonant clusters
include word-initial and word-final contrasts

Examples of useful seed words:

দিশা
মানু
কথা
অক্ষর
অগ্নি
অনুরোধ
অপমান
আকাশ
ইচ্ছা
উজ্জ্বল
একান্ত
ঔষধ

9. From Dictionary to Diphone Inventory

The most reliable method of building a diphone inventory is to derive it from the dictionary.

Dictionary word list
        ↓
IPA conversion
        ↓
Phoneme extraction
        ↓
Diphone generation
        ↓
Unique diphone inventory

This approach ensures that the inventory reflects actual lexical usage, not just theoretical combinations.

It also supports:

coverage analysis
missing diphone tracking
priority-based recording

10. Practical Inventory Example

A practical core diphone inventory for Bishnupriya Manipuri may include entries like:

Diphone	Safe filename	Priority	Example word
#-k	sil-k.wav	Core	কর
k-ɔ	k-aw.wav	Core	কথা
ɔ-tʰ	aw-th.wav	Core	কথা
tʰ-a	th-a.wav	Core	কথা
a-#	a-sil.wav	Core	কথা
g-n	g-n.wav	Extended	অগ্নি
k-ʃ	k-sh.wav	Extended	অক্ষর
ʃ-aː	sh-aa.wav	Core	দিশা
i-tʃ	i-ch.wav	Extended	ইচ্ছা

11. Recording Strategy

A practical recording workflow should follow these stages:

freeze the IPA and filename rules
prepare a clean seed word list
record words in a quiet environment
normalize all audio to one technical format
segment diphones from word recordings
store files in a clean /audio/diphone/ folder
validate coverage

Recommended normalized format:

44.1 kHz
mono
16-bit WAV

12. Validation and Rebuild Management

A diphone inventory should always be validated against the actual TTS system. A validator page should check:

expected diphone sequence
safe filename sequence
whether each WAV file exists
whether old mismatched files remain in the folder

A diphone tracker spreadsheet is also useful for monitoring:

recorded
segmented
uploaded
validator passed

13. Common Design Mistakes

Several mistakes can damage a diphone inventory rebuild:

changing IPA rules after audio has already been generated
changing safe filename rules in the middle of the project
mixing old and new diphone files in the same folder
recording directly at diphone level without stable word-level context
failing to validate with real dictionary words

A clean rebuild should begin by backing up the old diphone folder, creating a new empty folder, and populating it only with files generated under the current stable rule system.

14. Conclusion

A well-designed diphone inventory is the central resource of a diphone-based Bishnupriya Manipuri TTS system.

It must be:

derived from a stable phoneme inventory
built from real lexical data
recorded through carefully selected source words
named using a permanent safe filename system
validated against actual playback

When designed correctly, a relatively small inventory can synthesize a large portion of the language while remaining practical to build and maintain.

Article 6
Recording and Normalizing Diphone Audio for Bishnupriya Manipuri TTS

That article will explain how to record source words, normalize audio, remove silence, and prepare the recordings for diphone segmentation.

Index Terms in This Chapter

Suggested Citation

Designing a Diphone Inventory for Bishnupriya Manipuri. Web Edition. Version 1.0. 2026-03-10. Bishnupriya Manipuri Research Archive.

Bishnupriya Manipuri Research Archive

Language, linguistics, dictionary, IPA, phonemes, diphones, and speech technology