Chapter 6 — Recording the Language: Building the Audio Corpus

Bishnupriya Manipuri Dictionary and Language Science Project

Chapter 6 — Recording the Language: Building the Audio Corpus

A dictionary preserves the written vocabulary of a language, but speech technology requires another essential resource: a reliable audio corpus.

In order to build a speech synthesis system for the Bishnupriya Manipuri language, the project required a large collection of recorded words and phonetic units.

Creating such a corpus presents several challenges, especially for languages with limited technological infrastructure.

This chapter describes the process of recording, normalizing, and preparing audio data for use in the Bishnupriya Manipuri speech system.

1. Why an Audio Corpus is Necessary

Text alone cannot represent the full structure of a spoken language.

Speech synthesis requires actual recordings of linguistic sounds so that the system can reconstruct pronunciation through audio units.

For the Bishnupriya Manipuri project, the goal was to create recordings that could be used to generate diphones, the basic building blocks of the speech system.

These recordings were derived primarily from dictionary entries so that the audio corpus remains directly connected to the lexical database.

2. Recording Dictionary Words

The first step in building the audio corpus was recording individual dictionary words.

Each word was pronounced clearly and recorded as an independent audio file.

Recording individual words provides several advantages:

These recordings serve both educational purposes (for pronunciation learning) and technological purposes (for speech synthesis).

3. Audio Quality and Recording Conditions

A major challenge in building a speech corpus is maintaining consistent audio quality.

Even small differences in recording conditions can affect the naturalness of synthesized speech.

Several factors influence audio quality:

For speech synthesis systems, it is especially important that recordings share the same technical parameters.

4. Audio Normalization

During the recording process it quickly became clear that audio files differed in volume, sampling rate, and other technical properties.

These differences can produce unnatural transitions when audio segments are combined during speech synthesis.

To address this problem, all audio files were normalized to a consistent format.

Typical normalization parameters include:


Sample Rate: 44100 Hz
Channels: Mono
Bit Depth: 16-bit PCM
Volume Level: normalized

Normalization ensures that every audio file shares the same acoustic characteristics.

5. Segmenting Recordings into Diphones

Once recordings were normalized, the next step was to extract diphone segments.

A diphone represents the transition between two adjacent phonemes.

For example:


Word: দিশা

Phonemes:
d – i – ʃ – aː

Diphones:
#-d
d-i
i-ʃ
ʃ-aː
aː-#

Each diphone must correspond to a specific portion of the recorded waveform.

Segmenting audio accurately is a delicate process, because even small timing differences can affect the naturalness of synthesized speech.

6. Challenges of Automatic Segmentation

Automatic segmentation tools can sometimes divide audio recordings into phonetic segments.

However, such tools are often trained on major languages and may not perform reliably on Bishnupriya Manipuri data.

Several challenges arise during segmentation:

Because of these difficulties, manual inspection and correction are often necessary.

7. Building the Diphone Inventory

After segmentation, the extracted diphones are organized into a diphone inventory.

This inventory represents the set of phoneme transitions required to produce the sounds of the language.

For each diphone, the system stores:

The completeness of the diphone inventory directly affects the quality and coverage of the speech synthesis system.

8. Audio Validation

To ensure reliability, each diphone recording must be validated.

Validation checks include:

Automated validator tools can detect missing or inconsistent diphone files within the system.

9. Rebuilding the Speech Corpus

Speech systems are rarely completed in a single step.

As new recordings are added and segmentation improves, the diphone inventory may need to be rebuilt.

A typical rebuild workflow includes:


1. Record new audio
2. Normalize audio files
3. Segment recordings
4. Generate diphone files
5. Validate diphone coverage
6. Rebuild playback system

Through repeated refinement, the speech corpus gradually becomes more complete and more natural.

10. Toward a Sustainable Speech Corpus

The long-term goal of the project is to create a sustainable audio corpus that can support multiple linguistic applications.

These include:

By combining dictionary data with carefully recorded audio resources, the project establishes a foundation for future language technology development.

Building a speech corpus for a language with limited digital resources requires persistence and careful work. The recordings produced for this project represent not only technical data but also an important cultural record of the living sound of the Bishnupriya Manipuri language.