Automatic Diphone Segmentation from Dictionary Audio

Abstract. Automatic diphone segmentation is one of the most important engineering steps in a diphone-based text-to-speech system. Instead of recording each diphone separately, the system records complete dictionary words, converts those words into IPA and phoneme sequences, estimates diphone boundaries, and extracts reusable audio segments. This article explains the theory and workflow of automatic diphone segmentation for Bishnupriya Manipuri dictionary audio.

1. Introduction

A diphone-based TTS system requires many short audio segments representing transitions between adjacent phonemes. Recording every diphone individually is slow and unnatural. A more efficient method is to record complete words and automatically segment the audio into diphones.

Dictionary word
      ↓
Recorded audio
      ↓
IPA conversion
      ↓
Phoneme sequence
      ↓
Diphone list
      ↓
Automatic segmentation
      ↓
Reusable diphone WAV files

This method is especially useful for Bishnupriya Manipuri because dictionary audio already provides many of the required speech units.

2. Why Segment from Whole Words?

Whole-word recordings preserve natural articulation and coarticulation. If diphones are recorded directly in isolation, the transitions may sound artificial.

By recording words first, the system captures:

natural consonant-vowel transitions
natural vowel-to-consonant transitions
realistic timing and articulation
better continuity for TTS playback

Example:
Recorded word: দিশা
IPA: diʃa
Diphones:

#-d
d-i
i-ʃ
ʃ-a
a-#

3. Input Requirements

Automatic diphone segmentation requires at least three inputs:

The recorded audio file
The correct IPA transcription
The phoneme sequence derived from the IPA

For example:

Input type	Value
Word	উপকার
Audio file	upokar.wav
IPA	upokar
Phonemes	u p o k a r

4. From IPA to Diphone List

Once IPA is available, the phoneme sequence is extracted. The diphone list is then created by pairing adjacent phonemes, including word boundaries.

Example:
IPA:

u p o k a r

Diphone sequence:

#-u
u-p
p-o
o-k
k-a
a-r
r-#

These diphones are the target units that must be extracted from the recorded waveform.

5. Segmentation Principle

In automatic segmentation, the waveform is divided into approximate phoneme intervals. Diphone boundaries are then placed between phoneme midpoints.

phoneme 1     phoneme 2     phoneme 3
|------|------|------|------|------|
   mid1    mid2    mid3

diphone 1 = start → mid1
diphone 2 = mid1 → mid2
diphone 3 = mid2 → mid3
diphone 4 = mid3 → end

This midpoint method is a practical approximation when exact forced alignment is not available.

6. Midpoint-Based Diphone Segmentation

A simple automatic algorithm can work as follows:

measure the total duration of the normalized word recording
assign weighted durations to phonemes
estimate phoneme boundaries
calculate phoneme midpoints
extract each diphone from one midpoint to the next

Example word: কথা
IPA: kɔtʰa
Phonemes:

k ɔ tʰ a

Estimated diphones:

#-k
k-ɔ
ɔ-tʰ
tʰ-a
a-#

7. Weighted Phoneme Duration

Not all phonemes have equal duration in real speech. Vowels are usually longer than stops, and fricatives often last longer than plosives.

A practical weighting system can assign:

Phoneme class	Suggested relative weight
vowels	1.8
fricatives	1.35
nasals	1.25
aspirated stops	1.15
plain stops	0.85
liquids and glides	1.05

This produces more realistic approximate boundaries than equal segmentation.

8. Example Segmentation Workflow

Input word: দিশা
Audio duration: 0.62 s
IPA: diʃa
Phonemes:

d i ʃ a

Step 1: assign weights

d = 0.85
i = 1.8
ʃ = 1.35
a = 1.8

Step 2: estimate relative intervals

Step 3: compute phoneme midpoints

Step 4: extract diphones

#-d
d-i
i-ʃ
ʃ-a
a-#

Each extracted segment is then saved as an independent WAV file.

9. Safe Filename Generation

Each diphone must be converted into a stable filename for storage and lookup.

Diphone	Safe filename
#-d	sil-d.wav
d-i	d-i.wav
i-ʃ	i-sh.wav
ʃ-aː	sh-aa.wav
aː-#	aa-sil.wav

This safe filename system is essential because the TTS engine, validator, and segment generator must all agree on the same naming rules.

10. Audio Extraction

Once start and end times are estimated, the corresponding diphone audio can be sliced from the normalized word recording.

This extraction may be performed using:

FFmpeg
Praat scripts
custom Python tools
PHP-based server pipelines

A diphone file may be extracted as:

start = 0.182 s
end   = 0.315 s
file  = i-sh.wav

11. Why Some Segments Fail

Automatic segmentation is useful, but not perfect. Several things may cause bad diphone extraction:

incorrect IPA transcription
wrong phoneme tokenization
inconsistent speaking speed
excess silence
poor audio normalization
midpoint estimates that do not match the real articulation

Segmentation quality improves significantly when recordings are first normalized for silence, loudness, sample rate, and channel consistency.

12. Practical Engineering Workflow

A practical Bishnupriya Manipuri diphone segmentation workflow may follow these steps:

1. Record dictionary word audio
2. Normalize all recordings
3. Convert words to IPA
4. Extract phoneme sequence
5. Build diphone sequence
6. Estimate phoneme timing
7. Slice audio into diphone files
8. Save using stable safe filenames
9. Validate against expected diphone inventory

This process allows large numbers of diphones to be generated from a manageable set of word recordings.

13. Validator Integration

After segmentation, a validator should compare:

expected diphone list
safe filename list
actual WAV files in the diphone folder

This reveals:

missing files
mismatched filenames
old files from older rule versions
coverage gaps in the inventory

Validator result:

Word: দিশা
Expected:
sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

Missing:
i-sh.wav

14. Manual Review and Correction

Even with automatic segmentation, a small amount of manual review is often necessary. This is especially true for:

rare clusters
learned or borrowed words
very short words
noisy recordings

Manual review may involve:

checking waveform boundaries
replacing a bad segment with a better one
re-recording a problematic source word

15. Advantages of Automatic Segmentation

Despite its imperfections, automatic diphone segmentation offers major advantages:

fast expansion of the diphone inventory
natural coarticulation from whole-word recordings
reduced recording workload
scalable workflow for dictionary-based TTS

It is particularly effective in under-resourced language projects where manual phonetic annotation is limited.

16. Conclusion

Automatic diphone segmentation from dictionary audio provides a practical path toward building a reusable speech database for Bishnupriya Manipuri.

The pipeline combines:

word audio
+ IPA conversion
+ phoneme extraction
+ timing estimation
+ safe filename generation
= diphone library

Once the segmented diphones are validated and stored in a clean audio folder, they can be used directly by a diphone-based TTS engine.

Article 8
Implementing a Bishnupriya Manipuri TTS Engine in PHP and JavaScript

The next article explains how diphone files are loaded, sequenced, and played in a web-based TTS system.

Bishnupriya Manipuri Research Archive

Language, linguistics, dictionary, IPA, phonemes, diphones, and speech technology