Recording Protocol

Audio recording, normalization, preparation, and practical workflow for diphone extraction

Protocol

Recording Protocol

This page documents the practical recording workflow used for Bishnupriya Manipuri speech resource development, including setup, source word selection, normalization, segmentation preparation, and quality control for diphone extraction.

About this protocol. The article series explains why audio consistency matters in diphone-based TTS. This page provides the practical companion workflow: how to record, prepare, normalize, and organize source audio so it can be reused reliably across pronunciation tools and speech synthesis.

1. Purpose of the Recording Protocol

The quality of a diphone-based TTS system depends heavily on the quality and consistency of the source recordings. If recordings vary in loudness, speaking distance, background noise, timing, or technical format, the resulting diphone library becomes unstable.

Main goals:

capture clean word-level source audio
maintain one consistent technical standard
prepare audio for segmentation
reduce mismatch between recordings
support stable diphone extraction and playback

2. Why Record Whole Words Instead of Raw Diphones?

Whole-word recordings preserve natural coarticulation between sounds. This makes extracted diphones sound more natural than isolated transitions recorded out of context.

Word recording
   ↓
IPA conversion
   ↓
Phoneme sequence
   ↓
Diphone sequence
   ↓
Segmentation
   ↓
Reusable diphone files

Word-level recording is therefore the preferred source method for low-resource TTS development.

3. Recording Environment

Quiet Room

Record in the quietest room possible with minimal outside noise, fan noise, or reverberation.

Stable Microphone Position

Keep the microphone in one consistent position and avoid changing angle or distance.

Consistent Voice Delivery

Speak clearly, steadily, and with consistent loudness across the full session.

4. Recommended Technical Format

Parameter	Recommended Standard	Reason
Sample rate	44100 Hz	Stable archival and processing standard
Channels	Mono	Preferred for speech-processing workflows
Bit depth	16-bit PCM	Reliable, widely supported WAV format
File format	WAV	Best for lossless processing and segmentation

5. Recommended Recording Setup

Exact hardware can vary, but the recording conditions should be as stable as possible.

Recommended practice:

use one microphone for a full session
keep mouth-to-mic distance stable
record at the same gain setting throughout the session
avoid clipping
avoid moving while speaking

A consistent setup is more important than an expensive setup.

6. Source Word Selection Strategy

Recording should be guided by coverage needs, not random word choice. Source words should be selected so that together they cover the most useful diphones.

Good Source Words Should Include

common consonant-vowel transitions
common vowel-consonant transitions
word-initial boundary coverage
word-final boundary coverage
cluster-bearing words

Useful Example Types

simple open syllable words
words with nasal environments
words with affricates and fricatives
learned forms with clusters
high-frequency dictionary words

7. File Naming During Recording

Source recordings should use clear and consistent names. Keep recording filenames separate from diphone safe filenames.

Type	Example	Use
Raw source recording	001_disha.wav	Original recorded word file
Normalized recording	norm_disha.wav	Prepared source for segmentation
Diphone output	sil-d.wav	Reusable TTS unit

8. Silence Trimming and Cleanup

Each recording should be trimmed so that it contains only a small amount of silence at the beginning and end. Excess silence causes segmentation and playback problems.

Before cleanup:
[silence] + word + [silence]

After cleanup:
[word]

Trimming should be careful: remove unnecessary silence, but do not cut off real speech onset or release.

9. Audio Normalization

After recording, files should be normalized into one standard technical format. This makes later segmentation and playback much more reliable.

Normalization checklist:

sample rate converted to 44100 Hz
stereo converted to mono
bit depth standardized
levels kept within reasonable range
clipping checked and avoided

10. Batch Conversion Example

When recordings are not already in the correct format, FFmpeg can be used for batch conversion.

for %%f in (*.wav) do (
  ffmpeg -i "%%f" -ar 44100 -ac 1 -c:a pcm_s16le "fixed_%%f"
)

A professional workflow should run conversion into a new output set rather than repeatedly overwriting files with stacked prefixes.

11. Quality Control Before Segmentation

Before extracting diphones, the recordings should be checked for technical and perceptual consistency.

Technical Checks

sample rate
channels
bit depth
file integrity

Speech Checks

clear pronunciation
consistent speaking level
no clipped onset or ending
low background noise

Workflow Checks

correct file naming
matching IPA target
matching word list entry
ready for segmentation

12. Segmentation Preparation

Once the recordings are normalized and checked, they are ready for segmentation into diphones. At that point, each file needs:

the source word
its IPA representation
its phoneme sequence
its expected diphone sequence

Word audio
   + IPA
   + phoneme sequence
   + diphone sequence
   = segmentation-ready item

13. Recommended Workflow Summary

Prepare word list
   ↓
Record source words
   ↓
Trim silence
   ↓
Normalize format
   ↓
Check quality
   ↓
Pair with IPA and phonemes
   ↓
Segment into diphones
   ↓
Validate filenames and coverage

14. Common Recording Problems

Problem	Effect	Prevention
Changing microphone distance	Uneven loudness and tone	Keep one fixed recording position
Mixed technical formats	Inconsistent processing results	Normalize everything before segmentation
Too much silence	Poor segmentation and awkward playback	Trim carefully before segmentation
Overwriting files repeatedly	Filename confusion and clutter	Use clean output folders
Recording random words	Poor diphone coverage	Use a coverage-driven word list

15. Related Archive Pages

Article 6

Read the research chapter on recording and normalizing diphone audio.

Open Article 6 →

Diphone Inventory

Review inventory design, safe filename mapping, and recording priorities.

Open Diphone Inventory →

Resources

Return to the broader datasets/resources overview.

Open Resources →

Protocol note. This page should gradually grow into a fuller operational guide, including sample recording sheets, batch normalization scripts, validation checklists, and downloadable protocol documents.

Bishnupriya Manipuri Research Archive

Language, linguistics, dictionary, IPA, phonemes, diphones, and speech technology

Recording Protocol