Automatic Diphone Segmentation from Dictionary Audio
1. Introduction
A diphone-based TTS system requires many short audio segments representing transitions between adjacent phonemes. Recording every diphone individually is slow and unnatural. A more efficient method is to record complete words and automatically segment the audio into diphones.
Dictionary word
↓
Recorded audio
↓
IPA conversion
↓
Phoneme sequence
↓
Diphone list
↓
Automatic segmentation
↓
Reusable diphone WAV files
This method is especially useful for Bishnupriya Manipuri because dictionary audio already provides many of the required speech units.
2. Why Segment from Whole Words?
Whole-word recordings preserve natural articulation and coarticulation. If diphones are recorded directly in isolation, the transitions may sound artificial.
By recording words first, the system captures:
- natural consonant-vowel transitions
- natural vowel-to-consonant transitions
- realistic timing and articulation
- better continuity for TTS playback
Recorded word: দিশা
IPA: diʃa
Diphones:
#-d d-i i-ʃ ʃ-a a-#
3. Input Requirements
Automatic diphone segmentation requires at least three inputs:
- The recorded audio file
- The correct IPA transcription
- The phoneme sequence derived from the IPA
For example:
| Input type | Value |
|---|---|
| Word | উপকার |
| Audio file | upokar.wav |
| IPA | upokar |
| Phonemes | u p o k a r |
4. From IPA to Diphone List
Once IPA is available, the phoneme sequence is extracted. The diphone list is then created by pairing adjacent phonemes, including word boundaries.
IPA:
u p o k a rDiphone sequence:
#-u u-p p-o o-k k-a a-r r-#
These diphones are the target units that must be extracted from the recorded waveform.
5. Segmentation Principle
In automatic segmentation, the waveform is divided into approximate phoneme intervals. Diphone boundaries are then placed between phoneme midpoints.
phoneme 1 phoneme 2 phoneme 3 |------|------|------|------|------| mid1 mid2 mid3 diphone 1 = start → mid1 diphone 2 = mid1 → mid2 diphone 3 = mid2 → mid3 diphone 4 = mid3 → end
This midpoint method is a practical approximation when exact forced alignment is not available.
6. Midpoint-Based Diphone Segmentation
A simple automatic algorithm can work as follows:
- measure the total duration of the normalized word recording
- assign weighted durations to phonemes
- estimate phoneme boundaries
- calculate phoneme midpoints
- extract each diphone from one midpoint to the next
IPA: kɔtʰa
Phonemes:
k ɔ tʰ aEstimated diphones:
#-k k-ɔ ɔ-tʰ tʰ-a a-#
7. Weighted Phoneme Duration
Not all phonemes have equal duration in real speech. Vowels are usually longer than stops, and fricatives often last longer than plosives.
A practical weighting system can assign:
| Phoneme class | Suggested relative weight |
|---|---|
| vowels | 1.8 |
| fricatives | 1.35 |
| nasals | 1.25 |
| aspirated stops | 1.15 |
| plain stops | 0.85 |
| liquids and glides | 1.05 |
This produces more realistic approximate boundaries than equal segmentation.
8. Example Segmentation Workflow
Audio duration: 0.62 s
IPA: diʃa
Phonemes:
d i ʃ a
Step 1: assign weights
d = 0.85 i = 1.8 ʃ = 1.35 a = 1.8
Step 2: estimate relative intervals
Step 3: compute phoneme midpoints
Step 4: extract diphones
#-d d-i i-ʃ ʃ-a a-#
Each extracted segment is then saved as an independent WAV file.
9. Safe Filename Generation
Each diphone must be converted into a stable filename for storage and lookup.
| Diphone | Safe filename |
|---|---|
| #-d | sil-d.wav |
| d-i | d-i.wav |
| i-ʃ | i-sh.wav |
| ʃ-aː | sh-aa.wav |
| aː-# | aa-sil.wav |
This safe filename system is essential because the TTS engine, validator, and segment generator must all agree on the same naming rules.
10. Audio Extraction
Once start and end times are estimated, the corresponding diphone audio can be sliced from the normalized word recording.
This extraction may be performed using:
- FFmpeg
- Praat scripts
- custom Python tools
- PHP-based server pipelines
start = 0.182 s end = 0.315 s file = i-sh.wav
11. Why Some Segments Fail
Automatic segmentation is useful, but not perfect. Several things may cause bad diphone extraction:
- incorrect IPA transcription
- wrong phoneme tokenization
- inconsistent speaking speed
- excess silence
- poor audio normalization
- midpoint estimates that do not match the real articulation
12. Practical Engineering Workflow
A practical Bishnupriya Manipuri diphone segmentation workflow may follow these steps:
1. Record dictionary word audio 2. Normalize all recordings 3. Convert words to IPA 4. Extract phoneme sequence 5. Build diphone sequence 6. Estimate phoneme timing 7. Slice audio into diphone files 8. Save using stable safe filenames 9. Validate against expected diphone inventory
This process allows large numbers of diphones to be generated from a manageable set of word recordings.
13. Validator Integration
After segmentation, a validator should compare:
- expected diphone list
- safe filename list
- actual WAV files in the diphone folder
This reveals:
- missing files
- mismatched filenames
- old files from older rule versions
- coverage gaps in the inventory
Word: দিশা Expected: sil-d.wav d-i.wav i-sh.wav sh-a.wav a-sil.wav Missing: i-sh.wav
14. Manual Review and Correction
Even with automatic segmentation, a small amount of manual review is often necessary. This is especially true for:
- rare clusters
- learned or borrowed words
- very short words
- noisy recordings
Manual review may involve:
- checking waveform boundaries
- replacing a bad segment with a better one
- re-recording a problematic source word
15. Advantages of Automatic Segmentation
Despite its imperfections, automatic diphone segmentation offers major advantages:
- fast expansion of the diphone inventory
- natural coarticulation from whole-word recordings
- reduced recording workload
- scalable workflow for dictionary-based TTS
It is particularly effective in under-resourced language projects where manual phonetic annotation is limited.
16. Conclusion
Automatic diphone segmentation from dictionary audio provides a practical path toward building a reusable speech database for Bishnupriya Manipuri.
The pipeline combines:
word audio + IPA conversion + phoneme extraction + timing estimation + safe filename generation = diphone library
Once the segmented diphones are validated and stored in a clean audio folder, they can be used directly by a diphone-based TTS engine.
Next Article
Article 8 Implementing a Bishnupriya Manipuri TTS Engine in PHP and JavaScript
The next article explains how diphone files are loaded, sequenced, and played in a web-based TTS system.