Recording Protocol

Audio recording, normalization, preparation, and practical workflow for diphone extraction

Protocol

Recording Protocol

This page documents the practical recording workflow used for Bishnupriya Manipuri speech resource development, including setup, source word selection, normalization, segmentation preparation, and quality control for diphone extraction.

About this protocol. The article series explains why audio consistency matters in diphone-based TTS. This page provides the practical companion workflow: how to record, prepare, normalize, and organize source audio so it can be reused reliably across pronunciation tools and speech synthesis.

1. Purpose of the Recording Protocol

The quality of a diphone-based TTS system depends heavily on the quality and consistency of the source recordings. If recordings vary in loudness, speaking distance, background noise, timing, or technical format, the resulting diphone library becomes unstable.

Main goals:
  • capture clean word-level source audio
  • maintain one consistent technical standard
  • prepare audio for segmentation
  • reduce mismatch between recordings
  • support stable diphone extraction and playback

2. Why Record Whole Words Instead of Raw Diphones?

Whole-word recordings preserve natural coarticulation between sounds. This makes extracted diphones sound more natural than isolated transitions recorded out of context.

Word recording
   ↓
IPA conversion
   ↓
Phoneme sequence
   ↓
Diphone sequence
   ↓
Segmentation
   ↓
Reusable diphone files
  

Word-level recording is therefore the preferred source method for low-resource TTS development.

3. Recording Environment

Quiet Room

Record in the quietest room possible with minimal outside noise, fan noise, or reverberation.

Stable Microphone Position

Keep the microphone in one consistent position and avoid changing angle or distance.

Consistent Voice Delivery

Speak clearly, steadily, and with consistent loudness across the full session.

4. Recommended Technical Format

Parameter Recommended Standard Reason
Sample rate 44100 Hz Stable archival and processing standard
Channels Mono Preferred for speech-processing workflows
Bit depth 16-bit PCM Reliable, widely supported WAV format
File format WAV Best for lossless processing and segmentation

5. Recommended Recording Setup

Exact hardware can vary, but the recording conditions should be as stable as possible.

Recommended practice:
  • use one microphone for a full session
  • keep mouth-to-mic distance stable
  • record at the same gain setting throughout the session
  • avoid clipping
  • avoid moving while speaking

A consistent setup is more important than an expensive setup.

6. Source Word Selection Strategy

Recording should be guided by coverage needs, not random word choice. Source words should be selected so that together they cover the most useful diphones.

Good Source Words Should Include

  • common consonant-vowel transitions
  • common vowel-consonant transitions
  • word-initial boundary coverage
  • word-final boundary coverage
  • cluster-bearing words

Useful Example Types

  • simple open syllable words
  • words with nasal environments
  • words with affricates and fricatives
  • learned forms with clusters
  • high-frequency dictionary words

7. File Naming During Recording

Source recordings should use clear and consistent names. Keep recording filenames separate from diphone safe filenames.

Type Example Use
Raw source recording 001_disha.wav Original recorded word file
Normalized recording norm_disha.wav Prepared source for segmentation
Diphone output sil-d.wav Reusable TTS unit

8. Silence Trimming and Cleanup

Each recording should be trimmed so that it contains only a small amount of silence at the beginning and end. Excess silence causes segmentation and playback problems.

Before cleanup:
[silence] + word + [silence]

After cleanup:
[word]
  

Trimming should be careful: remove unnecessary silence, but do not cut off real speech onset or release.

9. Audio Normalization

After recording, files should be normalized into one standard technical format. This makes later segmentation and playback much more reliable.

Normalization checklist:
  • sample rate converted to 44100 Hz
  • stereo converted to mono
  • bit depth standardized
  • levels kept within reasonable range
  • clipping checked and avoided

10. Batch Conversion Example

When recordings are not already in the correct format, FFmpeg can be used for batch conversion.

for %%f in (*.wav) do (
  ffmpeg -i "%%f" -ar 44100 -ac 1 -c:a pcm_s16le "fixed_%%f"
)
  

A professional workflow should run conversion into a new output set rather than repeatedly overwriting files with stacked prefixes.

11. Quality Control Before Segmentation

Before extracting diphones, the recordings should be checked for technical and perceptual consistency.

Technical Checks

  • sample rate
  • channels
  • bit depth
  • file integrity

Speech Checks

  • clear pronunciation
  • consistent speaking level
  • no clipped onset or ending
  • low background noise

Workflow Checks

  • correct file naming
  • matching IPA target
  • matching word list entry
  • ready for segmentation

12. Segmentation Preparation

Once the recordings are normalized and checked, they are ready for segmentation into diphones. At that point, each file needs:

Word audio
   + IPA
   + phoneme sequence
   + diphone sequence
   = segmentation-ready item
  

13. Recommended Workflow Summary

Prepare word list
   ↓
Record source words
   ↓
Trim silence
   ↓
Normalize format
   ↓
Check quality
   ↓
Pair with IPA and phonemes
   ↓
Segment into diphones
   ↓
Validate filenames and coverage
  

14. Common Recording Problems

Problem Effect Prevention
Changing microphone distance Uneven loudness and tone Keep one fixed recording position
Mixed technical formats Inconsistent processing results Normalize everything before segmentation
Too much silence Poor segmentation and awkward playback Trim carefully before segmentation
Overwriting files repeatedly Filename confusion and clutter Use clean output folders
Recording random words Poor diphone coverage Use a coverage-driven word list

15. Related Archive Pages

Article 6

Read the research chapter on recording and normalizing diphone audio.

Open Article 6 →

Protocol note. This page should gradually grow into a fuller operational guide, including sample recording sheets, batch normalization scripts, validation checklists, and downloadable protocol documents.