Recording and Normalizing Diphone Audio for Bishnupriya Manipuri TTS

Abstract. The quality of a diphone-based text-to-speech system depends heavily on the quality and consistency of the recorded speech database. This article explains how to record source words, maintain consistent recording conditions, normalize audio files, and prepare them for diphone segmentation. These procedures ensure that the final speech output sounds natural and stable.

1. Introduction

A diphone TTS system relies on a database of short speech segments extracted from recorded words. If the recordings are inconsistent in volume, tone, or timing, the resulting speech synthesis will sound unnatural.

Therefore, careful recording and normalization are essential.

Recording
   ↓
Audio cleanup
   ↓
Normalization
   ↓
Segmentation
   ↓
Diphone database

2. Recording Environment

To produce high-quality recordings, the recording environment must be controlled.

Recommended setup

Recommended microphone distance: 15–20 cm from the speaker.

Avoid moving closer or farther from the microphone during the recording session.

3. Microphone and Recording Software

A professional studio microphone is ideal, but many USB microphones are sufficient for linguistic recording.

Common recording software includes:

These programs allow precise editing and waveform inspection.

4. Recording Format

All recordings must use the same technical format to avoid problems during concatenation.

Parameter Recommended value
Sample rate 44100 Hz
Channels Mono
Bit depth 16-bit PCM
File format WAV
Example file name:
001_disha.wav

5. Recording Strategy

Rather than recording isolated phonemes or diphones, it is better to record complete words.

Word-level recordings provide natural coarticulation, which improves diphone quality.

Example recording list:
দিশা
মানু
কথা
অক্ষর
অগ্নি
অনুরোধ
উজ্জ্বল
ঔষধ

These words contain a variety of phoneme transitions that can later be segmented into diphones.

6. Removing Silence

After recording, each audio file should be trimmed so that it contains minimal silence at the beginning and end.

Excess silence causes timing problems in synthesized speech.

Before trimming:
[ silence ] word audio [ silence ]
After trimming:
[word audio]

7. Volume Normalization

Recordings often vary in loudness. Normalization ensures that all audio files have consistent volume levels.

Normalization can be performed using audio software or command-line tools.

Example FFmpeg normalization command:
ffmpeg -i input.wav -af "volume=1.5" output.wav

Alternatively, RMS normalization can be applied to equalize perceived loudness.

8. Batch Normalization Using FFmpeg

Large audio collections can be normalized automatically using batch scripts.

Example Windows batch script:
for %%f in (*.wav) do (
  ffmpeg -i "%%f" -ar 44100 -ac 1 -c:a pcm_s16le "fixed_%%f"
)

This ensures that all files have consistent sample rate, channels, and bit depth.

9. Checking Audio Consistency

Before segmentation begins, all recordings should be verified.

Important checks include:

A script can generate a report listing the properties of each file.

10. Preparing for Diphone Segmentation

After normalization, the recordings are ready for segmentation.

Normalized word recordings
      ↓
IPA transcription
      ↓
Phoneme sequence
      ↓
Diphone boundaries
      ↓
Audio slicing

Each diphone is extracted from the recorded word audio and saved as a separate file.

Example diphone extraction: Word: দিশা IPA: diʃa Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#

11. File Organization

A clean directory structure simplifies diphone management.

project/
 ├── recordings/
 │    ├── 001_disha.wav
 │    ├── 002_manu.wav
 │
 ├── normalized/
 │    ├── disha.wav
 │
 ├── diphone/
 │    ├── sil-d.wav
 │    ├── d-i.wav
 │    ├── i-sh.wav
 │
 └── tools/
      ├── validator.php
      ├── analyzer.php

12. Common Recording Problems

Several issues commonly appear during recording sessions.

A consistent recording style is more important than perfect studio quality. Consistency ensures smooth diphone concatenation.

13. Conclusion

Recording and normalization form the foundation of a high-quality diphone database for Bishnupriya Manipuri speech synthesis.

Carefully recorded and normalized source words allow reliable segmentation, which in turn produces stable diphone audio units.

These diphones can then be combined by the TTS engine to produce intelligible and natural speech.

Next Article

Article 7
Automatic Diphone Segmentation from Dictionary Audio

The next article explains how recorded word audio can be automatically segmented into diphones using IPA alignment and boundary detection.