Recording and Normalizing Diphone Audio for Bishnupriya Manipuri TTS
1. Introduction
A diphone TTS system relies on a database of short speech segments extracted from recorded words. If the recordings are inconsistent in volume, tone, or timing, the resulting speech synthesis will sound unnatural.
Therefore, careful recording and normalization are essential.
Recording ↓ Audio cleanup ↓ Normalization ↓ Segmentation ↓ Diphone database
2. Recording Environment
To produce high-quality recordings, the recording environment must be controlled.
Recommended setup
- quiet room with minimal background noise
- soft surfaces to reduce echo
- stable microphone position
- consistent speaking distance
- constant recording level
Avoid moving closer or farther from the microphone during the recording session.
3. Microphone and Recording Software
A professional studio microphone is ideal, but many USB microphones are sufficient for linguistic recording.
Common recording software includes:
- Audacity
- Adobe Audition
- Reaper
- Praat
These programs allow precise editing and waveform inspection.
4. Recording Format
All recordings must use the same technical format to avoid problems during concatenation.
| Parameter | Recommended value |
|---|---|
| Sample rate | 44100 Hz |
| Channels | Mono |
| Bit depth | 16-bit PCM |
| File format | WAV |
001_disha.wav
5. Recording Strategy
Rather than recording isolated phonemes or diphones, it is better to record complete words.
Word-level recordings provide natural coarticulation, which improves diphone quality.
দিশা মানু কথা অক্ষর অগ্নি অনুরোধ উজ্জ্বল ঔষধ
These words contain a variety of phoneme transitions that can later be segmented into diphones.
6. Removing Silence
After recording, each audio file should be trimmed so that it contains minimal silence at the beginning and end.
Excess silence causes timing problems in synthesized speech.
[ silence ] word audio [ silence ]After trimming:
[word audio]
7. Volume Normalization
Recordings often vary in loudness. Normalization ensures that all audio files have consistent volume levels.
Normalization can be performed using audio software or command-line tools.
ffmpeg -i input.wav -af "volume=1.5" output.wav
Alternatively, RMS normalization can be applied to equalize perceived loudness.
8. Batch Normalization Using FFmpeg
Large audio collections can be normalized automatically using batch scripts.
for %%f in (*.wav) do ( ffmpeg -i "%%f" -ar 44100 -ac 1 -c:a pcm_s16le "fixed_%%f" )
This ensures that all files have consistent sample rate, channels, and bit depth.
9. Checking Audio Consistency
Before segmentation begins, all recordings should be verified.
Important checks include:
- sample rate
- channel count
- bit depth
- volume consistency
- absence of clipping
A script can generate a report listing the properties of each file.
10. Preparing for Diphone Segmentation
After normalization, the recordings are ready for segmentation.
Normalized word recordings
↓
IPA transcription
↓
Phoneme sequence
↓
Diphone boundaries
↓
Audio slicing
Each diphone is extracted from the recorded word audio and saved as a separate file.
#-d d-i i-ʃ ʃ-a a-#
11. File Organization
A clean directory structure simplifies diphone management.
project/
├── recordings/
│ ├── 001_disha.wav
│ ├── 002_manu.wav
│
├── normalized/
│ ├── disha.wav
│
├── diphone/
│ ├── sil-d.wav
│ ├── d-i.wav
│ ├── i-sh.wav
│
└── tools/
├── validator.php
├── analyzer.php
12. Common Recording Problems
Several issues commonly appear during recording sessions.
- varying microphone distance
- background noise
- inconsistent speaking speed
- clipped audio peaks
- inconsistent vowel length
13. Conclusion
Recording and normalization form the foundation of a high-quality diphone database for Bishnupriya Manipuri speech synthesis.
Carefully recorded and normalized source words allow reliable segmentation, which in turn produces stable diphone audio units.
These diphones can then be combined by the TTS engine to produce intelligible and natural speech.
Next Article
Article 7 Automatic Diphone Segmentation from Dictionary Audio
The next article explains how recorded word audio can be automatically segmented into diphones using IPA alignment and boundary detection.