Quiet Room
Record in the quietest room possible with minimal outside noise, fan noise, or reverberation.
Audio recording, normalization, preparation, and practical workflow for diphone extraction
Protocol
This page documents the practical recording workflow used for Bishnupriya Manipuri speech resource development, including setup, source word selection, normalization, segmentation preparation, and quality control for diphone extraction.
The quality of a diphone-based TTS system depends heavily on the quality and consistency of the source recordings. If recordings vary in loudness, speaking distance, background noise, timing, or technical format, the resulting diphone library becomes unstable.
Whole-word recordings preserve natural coarticulation between sounds. This makes extracted diphones sound more natural than isolated transitions recorded out of context.
Word recording ↓ IPA conversion ↓ Phoneme sequence ↓ Diphone sequence ↓ Segmentation ↓ Reusable diphone files
Word-level recording is therefore the preferred source method for low-resource TTS development.
Record in the quietest room possible with minimal outside noise, fan noise, or reverberation.
Keep the microphone in one consistent position and avoid changing angle or distance.
Speak clearly, steadily, and with consistent loudness across the full session.
| Parameter | Recommended Standard | Reason |
|---|---|---|
| Sample rate | 44100 Hz | Stable archival and processing standard |
| Channels | Mono | Preferred for speech-processing workflows |
| Bit depth | 16-bit PCM | Reliable, widely supported WAV format |
| File format | WAV | Best for lossless processing and segmentation |
Exact hardware can vary, but the recording conditions should be as stable as possible.
A consistent setup is more important than an expensive setup.
Recording should be guided by coverage needs, not random word choice. Source words should be selected so that together they cover the most useful diphones.
Source recordings should use clear and consistent names. Keep recording filenames separate from diphone safe filenames.
| Type | Example | Use |
|---|---|---|
| Raw source recording | 001_disha.wav | Original recorded word file |
| Normalized recording | norm_disha.wav | Prepared source for segmentation |
| Diphone output | sil-d.wav | Reusable TTS unit |
Each recording should be trimmed so that it contains only a small amount of silence at the beginning and end. Excess silence causes segmentation and playback problems.
Before cleanup: [silence] + word + [silence] After cleanup: [word]
Trimming should be careful: remove unnecessary silence, but do not cut off real speech onset or release.
After recording, files should be normalized into one standard technical format. This makes later segmentation and playback much more reliable.
When recordings are not already in the correct format, FFmpeg can be used for batch conversion.
for %%f in (*.wav) do ( ffmpeg -i "%%f" -ar 44100 -ac 1 -c:a pcm_s16le "fixed_%%f" )
A professional workflow should run conversion into a new output set rather than repeatedly overwriting files with stacked prefixes.
Before extracting diphones, the recordings should be checked for technical and perceptual consistency.
Once the recordings are normalized and checked, they are ready for segmentation into diphones. At that point, each file needs:
Word audio + IPA + phoneme sequence + diphone sequence = segmentation-ready item
Prepare word list ↓ Record source words ↓ Trim silence ↓ Normalize format ↓ Check quality ↓ Pair with IPA and phonemes ↓ Segment into diphones ↓ Validate filenames and coverage
| Problem | Effect | Prevention |
|---|---|---|
| Changing microphone distance | Uneven loudness and tone | Keep one fixed recording position |
| Mixed technical formats | Inconsistent processing results | Normalize everything before segmentation |
| Too much silence | Poor segmentation and awkward playback | Trim carefully before segmentation |
| Overwriting files repeatedly | Filename confusion and clutter | Use clean output folders |
| Recording random words | Poor diphone coverage | Use a coverage-driven word list |
Read the research chapter on recording and normalizing diphone audio.
Review inventory design, safe filename mapping, and recording priorities.
Return to the broader datasets/resources overview.