Automatic Diphone Segmentation from Dictionary Audio

Abstract. Automatic diphone segmentation is one of the most important engineering steps in a diphone-based text-to-speech system. Instead of recording each diphone separately, the system records complete dictionary words, converts those words into IPA and phoneme sequences, estimates diphone boundaries, and extracts reusable audio segments. This article explains the theory and workflow of automatic diphone segmentation for Bishnupriya Manipuri dictionary audio.

1. Introduction

A diphone-based TTS system requires many short audio segments representing transitions between adjacent phonemes. Recording every diphone individually is slow and unnatural. A more efficient method is to record complete words and automatically segment the audio into diphones.

Dictionary word
      ↓
Recorded audio
      ↓
IPA conversion
      ↓
Phoneme sequence
      ↓
Diphone list
      ↓
Automatic segmentation
      ↓
Reusable diphone WAV files

This method is especially useful for Bishnupriya Manipuri because dictionary audio already provides many of the required speech units.

2. Why Segment from Whole Words?

Whole-word recordings preserve natural articulation and coarticulation. If diphones are recorded directly in isolation, the transitions may sound artificial.

By recording words first, the system captures:

Example:
Recorded word: দিশা
IPA: diʃa
Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#

3. Input Requirements

Automatic diphone segmentation requires at least three inputs:

  1. The recorded audio file
  2. The correct IPA transcription
  3. The phoneme sequence derived from the IPA

For example:

Input type Value
Word উপকার
Audio file upokar.wav
IPA upokar
Phonemes u p o k a r

4. From IPA to Diphone List

Once IPA is available, the phoneme sequence is extracted. The diphone list is then created by pairing adjacent phonemes, including word boundaries.

Example:
IPA:
u p o k a r
Diphone sequence:
#-u
u-p
p-o
o-k
k-a
a-r
r-#

These diphones are the target units that must be extracted from the recorded waveform.

5. Segmentation Principle

In automatic segmentation, the waveform is divided into approximate phoneme intervals. Diphone boundaries are then placed between phoneme midpoints.

phoneme 1     phoneme 2     phoneme 3
|------|------|------|------|------|
   mid1    mid2    mid3

diphone 1 = start → mid1
diphone 2 = mid1 → mid2
diphone 3 = mid2 → mid3
diphone 4 = mid3 → end

This midpoint method is a practical approximation when exact forced alignment is not available.

6. Midpoint-Based Diphone Segmentation

A simple automatic algorithm can work as follows:

  1. measure the total duration of the normalized word recording
  2. assign weighted durations to phonemes
  3. estimate phoneme boundaries
  4. calculate phoneme midpoints
  5. extract each diphone from one midpoint to the next
Example word: কথা
IPA: kɔtʰa
Phonemes:
k ɔ tʰ a
Estimated diphones:
#-k
k-ɔ
ɔ-tʰ
tʰ-a
a-#

7. Weighted Phoneme Duration

Not all phonemes have equal duration in real speech. Vowels are usually longer than stops, and fricatives often last longer than plosives.

A practical weighting system can assign:

Phoneme class Suggested relative weight
vowels 1.8
fricatives 1.35
nasals 1.25
aspirated stops 1.15
plain stops 0.85
liquids and glides 1.05

This produces more realistic approximate boundaries than equal segmentation.

8. Example Segmentation Workflow

Input word: দিশা
Audio duration: 0.62 s
IPA: diʃa
Phonemes:
d i ʃ a

Step 1: assign weights

d = 0.85
i = 1.8
ʃ = 1.35
a = 1.8

Step 2: estimate relative intervals

Step 3: compute phoneme midpoints

Step 4: extract diphones

#-d
d-i
i-ʃ
ʃ-a
a-#

Each extracted segment is then saved as an independent WAV file.

9. Safe Filename Generation

Each diphone must be converted into a stable filename for storage and lookup.

Diphone Safe filename
#-d sil-d.wav
d-i d-i.wav
i-ʃ i-sh.wav
ʃ-aː sh-aa.wav
aː-# aa-sil.wav

This safe filename system is essential because the TTS engine, validator, and segment generator must all agree on the same naming rules.

10. Audio Extraction

Once start and end times are estimated, the corresponding diphone audio can be sliced from the normalized word recording.

This extraction may be performed using:

A diphone file may be extracted as:
start = 0.182 s
end   = 0.315 s
file  = i-sh.wav

11. Why Some Segments Fail

Automatic segmentation is useful, but not perfect. Several things may cause bad diphone extraction:

Segmentation quality improves significantly when recordings are first normalized for silence, loudness, sample rate, and channel consistency.

12. Practical Engineering Workflow

A practical Bishnupriya Manipuri diphone segmentation workflow may follow these steps:

1. Record dictionary word audio
2. Normalize all recordings
3. Convert words to IPA
4. Extract phoneme sequence
5. Build diphone sequence
6. Estimate phoneme timing
7. Slice audio into diphone files
8. Save using stable safe filenames
9. Validate against expected diphone inventory

This process allows large numbers of diphones to be generated from a manageable set of word recordings.

13. Validator Integration

After segmentation, a validator should compare:

This reveals:

Validator result:
Word: দিশা
Expected:
sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

Missing:
i-sh.wav

14. Manual Review and Correction

Even with automatic segmentation, a small amount of manual review is often necessary. This is especially true for:

Manual review may involve:

15. Advantages of Automatic Segmentation

Despite its imperfections, automatic diphone segmentation offers major advantages:

It is particularly effective in under-resourced language projects where manual phonetic annotation is limited.

16. Conclusion

Automatic diphone segmentation from dictionary audio provides a practical path toward building a reusable speech database for Bishnupriya Manipuri.

The pipeline combines:

word audio
+ IPA conversion
+ phoneme extraction
+ timing estimation
+ safe filename generation
= diphone library

Once the segmented diphones are validated and stored in a clean audio folder, they can be used directly by a diphone-based TTS engine.

Next Article

Article 8
Implementing a Bishnupriya Manipuri TTS Engine in PHP and JavaScript

The next article explains how diphone files are loaded, sequenced, and played in a web-based TTS system.