Bishnupriya Manipuri Dictionary and Language Science Project

বিষ্ণুপ্রিয়া মণিপুরী ওয়াহিকলা বারো ঠারবিজ্ঞান

Academic Volume

Lexicography, Digitization, Phonology, Diphone Synthesis, and Digital Language Preservation

by
Uttam Singha

Bishnupriya Manipuri Research Archive

Abstract. This volume documents the Bishnupriya Manipuri Dictionary and Language Science Project, bringing together dictionary history, orthographic variation, data collection, digitization, lexical corpus building, pronunciation modeling, diphone system design, validator workflows, digital dictionary architecture, and future directions in language technology and preservation.

Preface

The Bishnupriya Manipuri language possesses a rich literary and cultural tradition, yet its digital linguistic resources remain limited. The purpose of this project is to build a structured digital dictionary while also developing a broader computational framework for pronunciation modeling and speech technology.

The project draws primarily on the dictionary of Dr. K. P. Sinha and the dictionary of L. K. Sinha and Santosh Sinha, which represent different spelling traditions. Rather than forcing all entries into a single orthographic standard, the project preserves both traditions as part of the linguistic record.

Over more than a year of continuous work, the project has involved data collection, digitization, correction of lexical entries, IPA rule development, phoneme analysis, diphone system design, audio corpus building, validation workflows, and web-based dictionary development.

The result is both a digital dictionary and a language science platform intended to support preservation, research, and future technological development for the Bishnupriya Manipuri language.

Preface
Chapter 1 — Introduction to the Dictionary Project
Chapter 2 — History of Bishnupriya Manipuri Dictionaries
Chapter 3 — Orthographic Variation and Spelling Schools
Chapter 4 — Data Collection and Lexical Corpus Building
Chapter 5 — From Dictionary to Language Technology
Chapter 6 — Recording the Language: Building the Audio Corpus
Chapter 7 — Designing the Bishnupriya Manipuri Diphone System
Chapter 8 — Validator and Rebuild Workflow
Chapter 9 — The Digital Bishnupriya Manipuri Dictionary Platform
Chapter 10 — The Future of Bishnupriya Manipuri Language Technology
Appendix A — BPM to IPA Conversion Rules
Appendix B — Phoneme Inventory
Appendix C — Master Diphone Inventory
Appendix D — Recording Protocol
Appendix E — Safe Filename Mapping
Appendix F — System Architecture
Glossary of Linguistic Terms
Bibliography / References
Index of Technical Terms
List of Figures and Tables

Bishnupriya Manipuri Dictionary and Language Science Project

Introduction

The Bishnupriya Manipuri Dictionary and Language Science Project (বিষ্ণুপ্রিয়া মণিপুরী ওয়াহিকলা বারো ঠারবিজ্ঞান) is an ongoing effort to collect, digitize, analyze, and expand the lexical resources of the Bishnupriya Manipuri language.

The project has two closely related goals. First, it aims to build a comprehensive digital dictionary that preserves the vocabulary of the language and makes it accessible for modern users. Second, it aims to study the structure of the language through computational methods such as phonological analysis, pronunciation modeling, and speech synthesis.

Although Bishnupriya Manipuri has a long literary history, its lexical resources are scattered across printed dictionaries, personal collections, and community knowledge. Many of these sources exist only in printed form and have not yet been systematically digitized.

This project therefore attempts to bring together several important dictionaries into a unified digital framework while documenting their linguistic differences and orthographic traditions.

Source Dictionaries

The present project primarily draws upon three major sources.

Dr. K. P. Sinha’s Bishnupriya Manipuri–English Dictionary. This dictionary provides a valuable bridge between Bishnupriya Manipuri and English vocabulary and is widely used by students and researchers.
L. K. Sinha and Santosh Sinha’s Bishnupriya Manipuri–Bishnupriya Manipuri Dictionary. This work focuses on lexical explanation within the language itself and reflects another important orthographic tradition.
Additional lexical material gathered from community usage, literary texts, and personal word lists.

These dictionaries represent two different spelling schools within the language. Rather than attempting to erase one tradition in favor of another, this project intentionally preserves both.

The goal is not to impose a single orthographic authority, but to document linguistic reality and allow future generations to evaluate and refine the writing system.

Orthographic Diversity

Bishnupriya Manipuri has developed multiple orthographic traditions over time, influenced by Sanskrit scholarship, Bengali orthography, regional usage, and modern linguistic analysis.

The two dictionaries used in this project reflect these different traditions. Consequently, many words appear with slightly different spellings, phonological interpretations, or morphological analyses.

Instead of forcing all entries into a single standardized spelling, the digital dictionary records these variants whenever possible.

This approach allows the dictionary to serve both linguistic research and community usage without prematurely resolving debates that are still evolving within the language community.

Data Collection and Ongoing Work

Building a digital dictionary is not simply a matter of copying words from printed sources. Each entry must be verified, normalized, and sometimes corrected.

For more than a year, the present project has involved continuous work collecting, reviewing, and refining lexical data. This includes:

manual digitization of printed dictionaries
OCR correction
removal of typographic errors
normalization of spelling variants
identification of duplicate entries
verification of meanings

Even after extensive processing, the dictionary remains a living resource that continues to grow and improve as new entries are discovered and existing entries are refined.

Language Science Goals

Beyond lexicography, the dictionary project also supports broader language science goals.

The structured lexical database enables computational research in areas such as:

phonological analysis
automatic IPA generation
phoneme and diphone inventory development
speech synthesis
digital language preservation

These technological tools are not intended to replace traditional language study. Rather, they provide new ways to document, analyze, and teach the language in the digital age.

A Living Dictionary

The Bishnupriya Manipuri Dictionary and Language Science Project should be understood as an evolving archive rather than a finished book.

As new words, variants, and linguistic insights emerge, the dictionary will continue to expand and improve.

The hope is that this work will provide a foundation for future scholars, speakers, and developers who wish to preserve and advance the Bishnupriya Manipuri language.

History of Bishnupriya Manipuri Dictionaries

The documentation of vocabulary is one of the most important foundations for the preservation and study of any language. For the Bishnupriya Manipuri language, dictionaries have played a crucial role in recording the lexical richness of the language and transmitting knowledge from one generation to the next.

However, the development of Bishnupriya Manipuri dictionaries has not followed a single unified tradition. Instead, it reflects different scholarly approaches, orthographic preferences, and historical circumstances within the language community.

The modern digital dictionary project therefore stands on the work of earlier scholars who attempted to collect and organize the vocabulary of the language through printed dictionaries.

Early Lexical Documentation

For a long period of time, Bishnupriya Manipuri vocabulary was transmitted primarily through oral tradition and literary usage rather than through formal lexicographic works.

The language possesses a rich body of poetry, devotional literature, and cultural expression, but systematic dictionary compilation began only in the modern period when scholars and teachers recognized the importance of documenting the language in written reference form.

Early lexical efforts were often limited in scope and circulated within smaller educational or literary circles. Nevertheless, these early attempts established an important foundation for later dictionary projects.

Dr. K. P. Sinha’s Bishnupriya Manipuri–English Dictionary

One of the most influential contributions to Bishnupriya Manipuri lexicography is the Bishnupriya Manipuri–English dictionary compiled by Dr. K. P. Sinha.

This dictionary represents a significant effort to connect the vocabulary of Bishnupriya Manipuri with English explanations. Such bilingual dictionaries are especially valuable because they allow the language to reach a wider academic audience and support students who study the language in multilingual environments.

Dr. Sinha’s work provides a large number of lexical entries along with English meanings, making it an important reference for both linguistic research and educational use.

For the present digital project, this dictionary serves as one of the primary lexical sources. Many entries in the digital dictionary originate from this work, though they often require careful verification, normalization, and correction during digitization.

L. K. Sinha and Santosh Sinha’s Bishnupriya Manipuri Dictionary

Another important lexicographic contribution is the dictionary compiled by L. K. Sinha and Santosh Sinha.

Unlike the bilingual dictionary of Dr. K. P. Sinha, this work focuses on explanations within the Bishnupriya Manipuri language itself.

Such monolingual dictionaries are extremely valuable because they reflect how speakers of the language define and interpret words internally, rather than through translation into another language.

The entries in this dictionary often provide insights into usage, semantic nuance, and traditional interpretations that might not appear in bilingual dictionaries.

For this reason, the digital dictionary project incorporates lexical material from this source alongside entries from the Bishnupriya Manipuri–English dictionary.

Two Spelling Schools

An important feature of Bishnupriya Manipuri lexicography is the existence of multiple orthographic traditions.

The dictionaries used in this project reflect two different spelling schools that developed within the language community. These differences may involve:

alternative representations of vowels
different treatments of consonant clusters
variant spellings inherited from Sanskrit traditions
regional or scholarly preferences

Such variation is not unusual in languages with rich literary traditions and evolving writing practices.

Rather than attempting to force all entries into a single standardized spelling, the digital dictionary project takes a more descriptive approach.

Both spelling traditions are recorded and preserved wherever possible.

Future generations of speakers, scholars, and educators can then evaluate these traditions and decide which forms best represent the language in modern usage.

The Challenge of Digitizing Printed Dictionaries

Transforming printed dictionaries into a digital lexical database is a complex and time-consuming task.

Printed dictionaries often contain typographical inconsistencies, irregular formatting, and scanning artifacts when processed through optical character recognition (OCR).

As a result, the digitization process involves much more than simply copying text into a database.

Each entry must be carefully reviewed to ensure that:

the spelling is correct
the meaning is properly captured
the entry structure is consistent
duplicate entries are identified
OCR errors are corrected

In many cases, manual correction is unavoidable.

An Ongoing Process

The present dictionary project has involved more than a year of continuous work collecting and refining lexical data. Even after significant progress, the work remains ongoing.

New entries are discovered, existing entries require correction, and spelling variants must be carefully evaluated.

The goal of the project is therefore not to produce a final unchanging dictionary, but to create a living lexical archive that can grow and improve over time.

Importance for Language Preservation

Dictionaries play a crucial role in preserving the vocabulary and cultural knowledge embedded in a language.

For communities whose languages are not widely represented in major global databases, digital dictionaries become especially important tools for documentation and revitalization.

The Bishnupriya Manipuri Dictionary and Language Science Project aims not only to preserve lexical knowledge but also to connect that knowledge with modern language technologies such as pronunciation modeling and speech synthesis.

In this way, the dictionary becomes both a reference resource and a foundation for future linguistic research.

Orthographic Variation and Spelling Schools

One of the distinctive features of the Bishnupriya Manipuri language is the presence of multiple orthographic traditions. Unlike languages that have undergone strict spelling standardization, Bishnupriya Manipuri has evolved through several writing practices influenced by literary traditions, regional usage, and scholarly interpretation.

These variations appear not only in casual writing but also in formal lexicographic works such as dictionaries. As a result, different dictionaries sometimes represent the same word with slightly different spellings.

Understanding this orthographic diversity is essential for anyone working with the language, particularly in the creation of digital lexical resources.

Origins of Orthographic Diversity

The spelling variation in Bishnupriya Manipuri arises from several historical and linguistic factors.

Influence of Sanskrit vocabulary and grammatical traditions
Interaction with Bengali orthographic conventions
Regional pronunciation differences
Scholarly attempts to represent phonology more accurately
Different editorial preferences among dictionary compilers

Because the language historically developed across different regions and scholarly traditions, it is natural that multiple spelling approaches emerged.

In many cases these differences are not contradictions but alternative ways of representing the same linguistic structure.

The Two Major Spelling Schools

The dictionaries used in the present digital project reflect two main spelling traditions within the Bishnupriya Manipuri language.

One tradition is represented in the Bishnupriya Manipuri–English dictionary compiled by Dr. K. P. Sinha. This approach often reflects conventions influenced by earlier scholarly practices and attempts to align spelling with classical linguistic traditions.

Another spelling tradition appears in the dictionary compiled by L. K. Sinha and Santosh Sinha. This dictionary sometimes reflects a slightly different orthographic approach, including alternative spellings or phonological interpretations.

Both of these works are important scholarly contributions, and each reflects legitimate linguistic perspectives.

Examples of Spelling Variation

Orthographic variation may appear in several forms.

Alternative representations of vowel length
Different treatment of consonant clusters
Variation in Sanskrit-derived spellings
Differences in the use of conjunct consonants
Alternative representations of phonological processes such as schwa deletion

For example, a particular word may appear with one spelling in one dictionary and a slightly different spelling in another. Both spellings may reflect the same underlying pronunciation.

In a printed dictionary, editors sometimes choose one form and omit the other. However, a digital dictionary allows both forms to be recorded and linked together.

Why the Digital Dictionary Preserves Both Traditions

The present dictionary project intentionally avoids forcing all entries into a single orthographic standard.

Instead, the database records words from both spelling schools whenever they appear in reliable sources.

There are several reasons for this decision.

The language community has not yet reached a universal spelling standard.
Both traditions represent genuine historical scholarship.
Forcing a single spelling may erase important linguistic information.
Future scholars may develop improved orthographic standards.

By preserving both traditions, the digital dictionary functions not only as a reference tool but also as a linguistic archive.

Role of Computational Analysis

Modern language technology provides new tools for analyzing orthographic variation.

Through computational methods such as phonological conversion and pronunciation modeling, it becomes possible to compare different spellings at a deeper structural level.

For example, the digital dictionary project includes a rule-based system that converts Bishnupriya Manipuri words into the International Phonetic Alphabet (IPA).

When different spellings produce the same phonological output, they can be understood as orthographic variants rather than completely separate lexical items.

Such tools help researchers understand how different spelling traditions relate to the actual pronunciation of the language.

Implications for Dictionary Design

Orthographic diversity creates several challenges when designing a digital dictionary.

duplicate entries must be identified
variant spellings must be linked together
search systems must recognize multiple forms of a word
pronunciation models must operate independently of spelling differences

To address these issues, the dictionary database includes fields for storing the original spelling, normalized forms, and phonological representations.

This layered structure allows the dictionary to support both traditional spelling forms and modern computational analysis.

Future Development of Orthographic Standards

The existence of multiple spelling traditions should not be viewed as a weakness of the language. Rather, it reflects the natural historical development of a living linguistic community.

Over time, communities often move toward greater standardization through education, publishing, and digital communication.

However, such standardization should emerge through careful linguistic study and community consensus rather than through premature simplification.

By preserving multiple spelling traditions today, the digital dictionary project provides a foundation upon which future scholars and speakers can build more refined orthographic standards.

A Descriptive Approach to Language Documentation

The guiding philosophy of the Bishnupriya Manipuri Dictionary and Language Science Project is descriptive rather than prescriptive.

The goal is not to dictate how the language should be written, but to document how it has been written and used by different scholars and communities.

In doing so, the dictionary becomes a historical record of linguistic development as well as a practical tool for contemporary users.

Data Collection and Lexical Corpus Building

The foundation of any dictionary project is the collection of lexical data. For the Bishnupriya Manipuri Dictionary and Language Science Project, data collection has been one of the most challenging and time-consuming stages.

Unlike languages that already possess large digital corpora, most Bishnupriya Manipuri lexical material exists only in printed books, handwritten notes, or scattered personal collections. As a result, the first step of the project involved locating and gathering reliable lexical sources.

The goal was not merely to digitize one dictionary, but to create a unified lexical corpus that combines multiple sources while preserving their linguistic differences.

Primary Dictionary Sources

The digital dictionary project primarily draws from three major lexical sources.

Dr. K. P. Sinha’s Bishnupriya Manipuri–English dictionary, which provides bilingual lexical explanations.
L. K. Sinha and Santosh Sinha’s Bishnupriya Manipuri dictionary, which provides definitions within the language itself.
Additional lexical material collected from community usage, literary texts, and personal word lists.

Each of these sources contributes valuable information, but they also present different orthographic traditions and organizational structures.

Therefore, the process of building a unified digital corpus requires careful comparison and normalization of entries.

Digitizing Printed Dictionaries

The transition from printed dictionaries to a digital database is not a straightforward process.

Printed dictionaries often contain complex page layouts, including columns, abbreviations, cross-references, and specialized formatting. These features can make automated digitization difficult.

To convert the printed dictionaries into machine-readable form, several steps were required:

high-resolution scanning of printed pages
optical character recognition (OCR)
manual verification of OCR output
reconstruction of entry structures

Although OCR technology can accelerate digitization, it often introduces errors when processing Indic scripts or complex typographic layouts.

OCR Challenges

One of the most significant obstacles in digitizing Bishnupriya Manipuri dictionaries is the limitation of OCR technology.

OCR systems are typically optimized for widely used languages and scripts, and they may struggle with specialized orthographic features found in Bishnupriya Manipuri texts.

Common OCR problems include:

incorrect recognition of consonant conjuncts
confusion between visually similar characters
missing diacritic marks
broken words across line boundaries
misinterpretation of punctuation

As a result, every OCR-generated entry must be reviewed manually before it can be added to the lexical database.

Manual Correction and Verification

Because of the limitations of OCR, manual correction forms a central part of the digitization process.

Each dictionary entry must be carefully examined to ensure that:

the word is spelled correctly
the meaning is accurately captured
cross-references are preserved
typographical errors are removed

This process requires not only technical attention but also linguistic judgment. In many cases, the editor must consult multiple sources to confirm the correct spelling or meaning of a word.

Identifying Duplicate Entries

When multiple dictionaries are combined into a single database, duplicate entries naturally appear.

However, identifying duplicates is not always straightforward. Two entries may appear different at first glance because of spelling variation, yet represent the same lexical item.

For this reason, the project employs several strategies to detect potential duplicates:

comparison of normalized spellings
phonological comparison through IPA conversion
manual review of similar entries

This process helps maintain the integrity of the lexical corpus while preserving meaningful spelling variants.

Building the Lexical Database

Once entries have been digitized and verified, they are stored in a structured database.

Each entry typically contains several fields, including:

the original Bishnupriya Manipuri word
meaning or definition
source dictionary
part of speech
phonological representation
cross-references

This structured format allows the dictionary to support both traditional lexical lookup and advanced computational analysis.

A Year of Continuous Work

The creation of the digital dictionary has involved more than a year of continuous data collection and correction.

During this period, thousands of entries have been reviewed, corrected, and organized. Even after extensive work, the process remains ongoing.

Language documentation is rarely finished. New words appear, existing entries require refinement, and additional sources may become available.

The project therefore treats the dictionary not as a fixed product, but as a living and evolving lexical archive.

Toward a Digital Language Corpus

Beyond the immediate goal of building a dictionary, the lexical database also serves as the foundation for a broader digital language corpus.

Such a corpus can support research in areas such as:

phonological analysis
lexical frequency studies
automatic pronunciation generation
speech synthesis
language preservation

By transforming traditional dictionaries into a structured digital corpus, the project opens new possibilities for linguistic research and technological development.

Chapter 5 — From Dictionary to Language Technology

A traditional dictionary records words and meanings. However, when lexical data is organized in a structured digital database, it becomes possible to connect dictionary entries to computational language technology.

The Bishnupriya Manipuri Dictionary and Language Science Project extends beyond simple lexicography. The digital dictionary forms the foundation for pronunciation modeling, phonological analysis, and speech synthesis.

In this way, the dictionary becomes not only a reference work but also the core infrastructure for modern language technology.

1. Dictionary as a Linguistic Database

Printed dictionaries typically present words in alphabetical order with definitions and occasional grammatical notes. While this format is useful for human readers, it is not optimized for computational analysis.

The digital dictionary reorganizes lexical data into a structured database format. Each entry may contain several fields such as:

headword (Bishnupriya Manipuri spelling)
meaning or translation
part of speech
source dictionary
phonological representation
cross-references

This structure allows computers to analyze and process lexical information systematically.

Once the dictionary becomes a database, it can support linguistic analysis and automated tools.

2. Generating Pronunciation (BPM → IPA)

One of the most important steps in connecting a dictionary to language technology is the conversion of written words into phonological representation.

In this project, Bishnupriya Manipuri words are converted into the International Phonetic Alphabet (IPA).

This conversion is performed using a rule-based system that analyzes the orthography of the word and applies phonological rules such as:

vowel mapping
consonant cluster interpretation
schwa deletion
special phonological exceptions

The resulting IPA representation reflects how the word is expected to be pronounced.

Because the rules are computational, the system can generate pronunciation automatically for thousands of dictionary entries.

3. From IPA to Phonemes

Once a word has been converted into IPA, the next step is to analyze the phonological structure of the pronunciation.

This process involves identifying the individual phonemes that compose the word.

For example:


Word: দিশা

IPA: diʃaː

Phoneme sequence:
d – i – ʃ – aː

Phoneme analysis allows the system to examine how sounds combine to form syllables and words.

This stage also forms the basis for constructing the diphone inventory used by the speech synthesis system.

4. Building Diphones

A diphone represents the transition between two adjacent sounds.

Instead of recording every possible word, a diphone speech system records transitions between phonemes.

Example:


Phonemes:
d – i – ʃ – aː

Diphones:
#-d
d-i
i-ʃ
ʃ-aː
aː-#

By recording these transitions, the system can reconstruct the pronunciation of many different words.

This approach dramatically reduces the number of audio recordings needed to build a speech system.

5. Dictionary-Driven Speech Synthesis

In the Bishnupriya Manipuri speech system, dictionary entries provide the starting point for generating spoken output.

The general pipeline is:


Dictionary Word
      ↓
Orthographic Analysis
      ↓
BPM → IPA Conversion
      ↓
Phoneme Extraction
      ↓
Diphone Sequence Generation
      ↓
Audio Playback

When a user clicks a word in the digital dictionary, the system automatically generates the required diphone sequence and plays the corresponding audio segments.

This architecture connects lexicography directly with speech technology.

6. Why Dictionaries Matter for Speech Technology

Modern speech systems often rely on large datasets, machine learning models, and massive audio corpora.

For languages with limited digital resources, however, dictionaries can serve as the primary foundation for building language technology.

A well-structured dictionary provides:

a comprehensive word list
semantic information
grammatical categories
a basis for pronunciation modeling

By linking lexical entries to phonological representation and audio units, the dictionary becomes the central hub of the speech system.

7. Toward Integrated Language Infrastructure

The long-term vision of the Bishnupriya Manipuri Dictionary and Language Science Project is to create an integrated language infrastructure.

In such a system, the dictionary supports multiple functions:

lexical reference
pronunciation learning
speech synthesis
linguistic research
digital language preservation

Through the integration of lexicography and technology, the project demonstrates how traditional dictionary work can evolve into a broader platform for language science.

This chapter illustrates how a digital dictionary can evolve from a traditional reference work into the core infrastructure of language technology. By connecting lexical data with phonological modeling and speech synthesis, the dictionary becomes both a scholarly resource and a technological platform for the preservation and development of the Bishnupriya Manipuri language.

Chapter 6 — Recording the Language: Building the Audio Corpus

A dictionary preserves the written vocabulary of a language, but speech technology requires another essential resource: a reliable audio corpus.

In order to build a speech synthesis system for the Bishnupriya Manipuri language, the project required a large collection of recorded words and phonetic units.

Creating such a corpus presents several challenges, especially for languages with limited technological infrastructure.

This chapter describes the process of recording, normalizing, and preparing audio data for use in the Bishnupriya Manipuri speech system.

1. Why an Audio Corpus is Necessary

Text alone cannot represent the full structure of a spoken language.

Speech synthesis requires actual recordings of linguistic sounds so that the system can reconstruct pronunciation through audio units.

For the Bishnupriya Manipuri project, the goal was to create recordings that could be used to generate diphones, the basic building blocks of the speech system.

These recordings were derived primarily from dictionary entries so that the audio corpus remains directly connected to the lexical database.

2. Recording Dictionary Words

The first step in building the audio corpus was recording individual dictionary words.

Each word was pronounced clearly and recorded as an independent audio file.

Recording individual words provides several advantages:

clear pronunciation of lexical entries
consistent recording conditions
flexibility for segmentation
reusability across multiple linguistic analyses

These recordings serve both educational purposes (for pronunciation learning) and technological purposes (for speech synthesis).

3. Audio Quality and Recording Conditions

A major challenge in building a speech corpus is maintaining consistent audio quality.

Even small differences in recording conditions can affect the naturalness of synthesized speech.

Several factors influence audio quality:

microphone quality
recording environment
background noise
speaker distance from microphone
recording software settings

For speech synthesis systems, it is especially important that recordings share the same technical parameters.

4. Audio Normalization

During the recording process it quickly became clear that audio files differed in volume, sampling rate, and other technical properties.

These differences can produce unnatural transitions when audio segments are combined during speech synthesis.

To address this problem, all audio files were normalized to a consistent format.

Typical normalization parameters include:


Sample Rate: 44100 Hz
Channels: Mono
Bit Depth: 16-bit PCM
Volume Level: normalized

Normalization ensures that every audio file shares the same acoustic characteristics.

5. Segmenting Recordings into Diphones

Once recordings were normalized, the next step was to extract diphone segments.

A diphone represents the transition between two adjacent phonemes.

For example:


Word: দিশা

Phonemes:
d – i – ʃ – aː

Diphones:
#-d
d-i
i-ʃ
ʃ-aː
aː-#

Each diphone must correspond to a specific portion of the recorded waveform.

Segmenting audio accurately is a delicate process, because even small timing differences can affect the naturalness of synthesized speech.

6. Challenges of Automatic Segmentation

Automatic segmentation tools can sometimes divide audio recordings into phonetic segments.

However, such tools are often trained on major languages and may not perform reliably on Bishnupriya Manipuri data.

Several challenges arise during segmentation:

variation in pronunciation speed
coarticulation between sounds
ambiguous phoneme boundaries
differences in recording amplitude

Because of these difficulties, manual inspection and correction are often necessary.

7. Building the Diphone Inventory

After segmentation, the extracted diphones are organized into a diphone inventory.

This inventory represents the set of phoneme transitions required to produce the sounds of the language.

For each diphone, the system stores:

diphone identifier
IPA representation
safe filename format
corresponding audio file

The completeness of the diphone inventory directly affects the quality and coverage of the speech synthesis system.

8. Audio Validation

To ensure reliability, each diphone recording must be validated.

Validation checks include:

correct diphone labeling
proper audio format
absence of clipping or distortion
correct alignment with phoneme boundaries

Automated validator tools can detect missing or inconsistent diphone files within the system.

9. Rebuilding the Speech Corpus

Speech systems are rarely completed in a single step.

As new recordings are added and segmentation improves, the diphone inventory may need to be rebuilt.

A typical rebuild workflow includes:


1. Record new audio
2. Normalize audio files
3. Segment recordings
4. Generate diphone files
5. Validate diphone coverage
6. Rebuild playback system

Through repeated refinement, the speech corpus gradually becomes more complete and more natural.

10. Toward a Sustainable Speech Corpus

The long-term goal of the project is to create a sustainable audio corpus that can support multiple linguistic applications.

These include:

speech synthesis
pronunciation teaching
phonological research
digital language preservation

By combining dictionary data with carefully recorded audio resources, the project establishes a foundation for future language technology development.

Building a speech corpus for a language with limited digital resources requires persistence and careful work. The recordings produced for this project represent not only technical data but also an important cultural record of the living sound of the Bishnupriya Manipuri language.

Chapter 7 — Designing the Bishnupriya Manipuri Diphone System

The creation of a diphone-based speech synthesis system requires a careful understanding of the phonological structure of a language. In the Bishnupriya Manipuri Dictionary and Language Science Project, the diphone system forms the core mechanism that enables automatic pronunciation playback.

Rather than recording every possible word in the language, the diphone approach records transitions between phonemes. These transitions can then be combined to synthesize the pronunciation of many different words.

1. Phoneme Inventory of the Language

The first step in designing a diphone system is identifying the phoneme inventory of the language.

Phonemes are the minimal sound units that distinguish meaning between words.

Based on phonological analysis, Bishnupriya Manipuri includes several categories of phonemes:

vowels
long vowels
nasal vowels
stops
fricatives
nasals
approximants

These phonemes form the basic building blocks from which diphones are constructed.

2. What is a Diphone?

A diphone represents the transition between two adjacent phonemes.

Speech sounds are not isolated units. When a speaker moves from one phoneme to another, the acoustic signal changes gradually.

Diphones capture these transitions.

For example:


Phoneme sequence:
d – i – ʃ – aː

Diphone sequence:
#-d
d-i
i-ʃ
ʃ-aː
aː-#

The symbol "#" represents the beginning or end of a word.

By recording these transitions, a speech synthesis system can reconstruct the pronunciation of many words.

3. Determining the Required Diphones

A key question in building a diphone system is determining how many diphones are required to cover the phonological structure of the language.

In theory, if a language contains N phonemes, the number of possible diphones is approximately:


N × N

However, many of these combinations do not occur in actual words.

Therefore, the diphone inventory is usually constructed by analyzing real lexical data from a dictionary or corpus.

In the present project, dictionary entries serve as the primary source for generating diphone combinations.

4. Extracting Diphones from Dictionary Words

After dictionary words are converted into IPA, the phoneme sequence of each word can be analyzed.

From this sequence, the system automatically generates diphone pairs.

For example:


Word: দিশা

IPA: diʃaː

Phonemes:
d i ʃ aː

Generated diphones:
#-d
d-i
i-ʃ
ʃ-aː
aː-#

Repeating this process for thousands of dictionary words produces a large inventory of diphone transitions.

5. Safe Filename System

IPA symbols contain special characters that are not always suitable for file naming.

To ensure compatibility with web servers and operating systems, the project introduces a safe filename mapping.

In this system, each IPA symbol is converted into a standardized ASCII representation.

For example:


IPA diphone: ʃ-aː

Safe filename: sh-aa.wav

This mapping ensures that diphone audio files can be stored and accessed reliably within the speech synthesis system.

6. Diphone Coverage Analysis

A major challenge in diphone-based systems is ensuring that all required diphones have corresponding audio recordings.

To address this issue, the project includes a diphone coverage analyzer.

This tool compares the diphones generated from dictionary words with the diphone audio files available in the system.

The analyzer identifies:

missing diphones
unused diphones
coverage percentage

This information helps guide the recording of additional audio segments.

7. Validator Workflow

Once diphone audio files are recorded, they must be validated before they can be used by the speech system.

The validator checks several properties:

correct diphone labeling
consistent filename format
audio sample rate
bit depth
channel configuration

Ensuring technical consistency prevents playback errors and improves the quality of synthesized speech.

8. Integration with the Dictionary

The diphone system is closely integrated with the digital dictionary.

When a user clicks the audio button for a dictionary entry, the system performs the following steps:


1. Convert word to IPA
2. Extract phoneme sequence
3. Generate diphone sequence
4. Locate corresponding audio files
5. Play diphone audio in sequence

Through this process, the dictionary becomes an interactive pronunciation system.

9. Advantages of the Diphone Approach

The diphone method offers several advantages for languages with limited digital resources.

requires fewer recordings than word-based systems
supports large vocabulary coverage
works with rule-based pronunciation systems
can be implemented with relatively simple software

For the Bishnupriya Manipuri language, this approach provides a practical path toward speech technology development.

The design of the Bishnupriya Manipuri diphone system illustrates how traditional linguistic analysis can be combined with computational methods to create new tools for language preservation.

By linking dictionary data, phonological analysis, and audio recordings, the project establishes a foundation for speech technology in the Bishnupriya Manipuri language.

Chapter 8 — Validator and Rebuild Workflow

Developing a speech synthesis system involves many interconnected components. In the Bishnupriya Manipuri Dictionary and Language Science Project, several tools work together to convert dictionary entries into spoken audio.

Because these components depend on one another, even a small inconsistency can cause the system to fail.

For this reason, the project includes a validator and rebuild workflow designed to detect errors, repair inconsistencies, and maintain synchronization between the dictionary, phonological conversion rules, and diphone audio files.

1. The Need for Validation

A diphone-based speech system depends on the availability of correctly labeled audio files.

If a required diphone is missing, the speech system cannot produce the correct pronunciation.

Common problems include:

missing diphone audio files
incorrect filename mappings
inconsistent IPA conversion rules
mismatched diphone generation algorithms

Without systematic validation, these issues can accumulate and make the speech system unreliable.

2. Diphone Validator Tool

To detect such problems, the project includes a diphone validator tool.

The validator analyzes the diphone sequence generated from a dictionary word and compares it with the diphone audio files available in the system.

The validator can report:

missing diphone files
extra or unused diphones
incorrect filename formats
coverage statistics

This information helps identify exactly which audio segments must be recorded or corrected.

3. Coverage Analysis

One of the most useful outputs of the validator is diphone coverage analysis.

Coverage measures the percentage of diphone transitions required by the dictionary that are already available as audio recordings.

For example:


Total diphones required: 520

Diphones recorded: 468

Coverage: 90%

Missing diphones: 52

Coverage analysis helps prioritize which diphones must be recorded next.

4. Synchronization Problems

During development, a major challenge was ensuring that all components of the system used the same conversion rules.

Several pages within the system performed similar tasks, including:

IPA conversion
phoneme extraction
diphone generation
safe filename mapping

If these components used slightly different rules, the diphone sequences generated on one page could differ from those generated on another page.

Such inconsistencies often produced missing diphone errors even when the audio files existed.

5. Unifying Conversion Rules

To resolve synchronization problems, the project introduced a unified conversion module.

This module performs several tasks:

BPM orthography to IPA conversion
IPA to phoneme tokenization
diphone generation
safe filename mapping

All pages in the system now rely on this shared module.

This ensures that every component generates identical diphone sequences for the same word.

6. Rebuild Workflow

When diphone recordings are updated or conversion rules change, the diphone system must be rebuilt.

The rebuild workflow typically follows these steps:


1. Update dictionary entries
2. Generate IPA pronunciation
3. Extract phoneme sequences
4. Generate diphone sequences
5. Compare diphones with audio files
6. Identify missing diphones
7. Record or generate missing segments
8. Re-run validation
9. Deploy updated diphone inventory

This structured process ensures that the speech system remains consistent and reliable.

7. Automating the Workflow

To simplify maintenance, several automation tools were developed for the project.

These tools can:

analyze dictionary entries in batch mode
generate diphone inventories automatically
detect missing audio files
produce reports for recording sessions

Automation greatly reduces the manual effort required to maintain the speech system.

8. Importance for Future Development

The validator and rebuild workflow is essential for maintaining a sustainable speech system.

Without such tools, the system could easily become inconsistent as new words and recordings are added.

By integrating validation and rebuild procedures into the development process, the project ensures that the Bishnupriya Manipuri speech system remains scalable and maintainable.

The validator workflow demonstrates an important principle of language technology: successful systems depend not only on linguistic analysis but also on robust engineering practices.

Through systematic validation and rebuild procedures, the project transforms experimental tools into a reliable linguistic infrastructure for the Bishnupriya Manipuri language.

Chapter 9 — The Digital Bishnupriya Manipuri Dictionary Platform

After the lexical corpus, phonological rules, and diphone system were established, the next step of the project was to build a digital platform that could make the dictionary accessible to users.

The Bishnupriya Manipuri digital dictionary is implemented as a web-based system using PHP and a MySQL database. This platform integrates lexical data, pronunciation modeling, and speech synthesis into a single interactive interface.

Through this system, users can search for words, view lexical information, and listen to automatically generated pronunciations.

1. Database Architecture

At the core of the digital dictionary is a structured database.

Each dictionary entry is stored as a row in a database table. The table typically contains fields such as:

word (Bishnupriya Manipuri spelling)
meaning
part of speech
source dictionary
IPA pronunciation
additional notes

This database structure allows the dictionary to support fast searching and flexible data analysis.

Because the database stores entries in a structured format, it also allows integration with computational tools such as pronunciation converters and diphone generators.

2. Search System

The digital dictionary includes a search system that allows users to locate words quickly.

When a user enters a query, the system searches the database for matching entries.

Search results may include:

exact matches
partial matches
related lexical entries

Efficient search functionality is essential for making the dictionary usable as a practical reference tool.

3. Word Detail Pages

When a user selects a word from the search results, the system displays a detailed word page.

The word page typically includes:

the headword
meaning or definition
part of speech
IPA pronunciation
links to related entries
audio playback controls

These pages serve as the main interface between the dictionary database and the user.

4. Pronunciation Playback

One of the most distinctive features of the digital dictionary is its ability to generate pronunciation audio automatically.

When the user clicks the audio button on a word page, the system performs several steps:


1. Retrieve the word from the database
2. Convert the word to IPA
3. Extract phoneme sequence
4. Generate diphone sequence
5. Load diphone audio files
6. Play the diphone sequence

This process allows the dictionary to function not only as a textual reference but also as a pronunciation learning tool.

5. Application Programming Interface (API)

To support modular development, the dictionary platform includes an application programming interface (API).

The API allows different components of the system to communicate with each other.

For example:

the web interface can request pronunciation data
analysis tools can retrieve phonological information
external applications can access dictionary entries

This architecture makes the system flexible and easier to expand in the future.

6. Integration with the TTS Engine

The digital dictionary platform is closely integrated with the diphone-based text-to-speech system.

The TTS engine receives input from the dictionary and converts it into a sequence of diphone audio files.

This integration allows dictionary entries to be pronounced dynamically, even if the word has never been recorded as a complete audio file.

In this way, the dictionary becomes the central hub that connects lexical data, phonological analysis, and speech synthesis.

7. Web-Based Interface

The dictionary platform is implemented as a web application so that it can be accessed easily from different devices.

A web-based interface provides several advantages:

no installation required
cross-platform compatibility
easy updates and maintenance
accessibility for a global audience

Users can therefore access the dictionary from desktop computers, tablets, or mobile devices.

8. Continuous Improvement

The digital dictionary is not a static resource.

As new entries are added, and as pronunciation models improve, the system continues to evolve.

Because the dictionary is built on a flexible database and modular architecture, it can support future enhancements such as:

expanded lexical entries
improved pronunciation models
additional linguistic annotations
integration with language learning tools

These improvements will help ensure that the dictionary remains a valuable resource for both researchers and speakers of the language.

The digital dictionary platform represents the practical realization of the Bishnupriya Manipuri Dictionary and Language Science Project. By combining lexicography, phonological analysis, and speech synthesis, the platform demonstrates how traditional language scholarship can be integrated with modern digital technology.

Chapter 10 — The Future of Bishnupriya Manipuri Language Technology

The Bishnupriya Manipuri Dictionary and Language Science Project represents an important step toward the digital preservation and technological development of the language.

Through the creation of a structured dictionary, phonological conversion systems, and diphone-based speech synthesis, the project demonstrates that modern language technology can be developed even for languages with limited digital resources.

However, this work should be understood as the beginning of a much larger effort rather than its final stage.

1. Expanding the Dictionary

Although the digital dictionary already contains a large number of entries, the vocabulary of any living language continues to grow and evolve.

Future work may involve:

adding new lexical entries
including regional vocabulary
documenting idiomatic expressions
recording example sentences
improving semantic explanations

Over time, the dictionary may become a comprehensive lexical archive representing the full richness of the Bishnupriya Manipuri language.

2. Community Collaboration

A language cannot be preserved by technology alone. The long-term success of the dictionary project depends on collaboration within the language community.

Speakers, teachers, writers, and researchers can contribute to the project by providing:

new vocabulary
example usages
pronunciation recordings
linguistic observations

Through such collaboration, the dictionary can become a shared cultural resource maintained by the community itself.

3. Improving Speech Technology

The diphone-based speech system created for this project provides a functional foundation for pronunciation playback.

Future research may explore additional technologies, including:

more natural diphone recordings
unit-selection speech synthesis
neural speech synthesis methods
improved prosody modeling

Such developments could make the pronunciation system more natural and expressive.

4. Language Learning Applications

The digital dictionary and speech system also open possibilities for language education.

Future applications could include:

interactive vocabulary learning tools
pronunciation training systems
mobile dictionary applications
digital language courses

These tools would make it easier for younger generations to learn and maintain the language.

5. Linguistic Research Opportunities

The structured lexical database created for this project can support many forms of linguistic research.

Possible research areas include:

phonological analysis
lexical frequency studies
historical linguistics
comparative Indo-Aryan linguistics

By making lexical data accessible in digital form, the project provides valuable resources for future scholars.

6. Open Digital Resources

One important goal of the project is to make linguistic resources available to researchers and developers.

Open digital datasets may include:

dictionary word lists
phonological data
diphone inventories
audio recordings

Such resources can encourage further development of language technology tools for Bishnupriya Manipuri.

7. Preserving Minority Languages in the Digital Age

Many languages around the world face challenges in the digital era because technological tools are often designed primarily for widely spoken languages.

Projects like the Bishnupriya Manipuri Dictionary and Language Science Project demonstrate that smaller language communities can also develop digital infrastructure to preserve and promote their languages.

Digital dictionaries, speech synthesis systems, and linguistic databases can help ensure that these languages remain accessible to future generations.

8. A Living Language Archive

Ultimately, the dictionary project should be viewed as a living archive rather than a completed product.

As technology evolves and as new linguistic knowledge emerges, the dictionary and speech system can continue to grow and improve.

By combining linguistic scholarship, community participation, and digital technology, the Bishnupriya Manipuri language can continue to thrive in the modern world.

The Bishnupriya Manipuri Dictionary and Language Science Project illustrates how dedicated linguistic work can bridge the gap between traditional scholarship and modern technology.

Through ongoing collaboration and innovation, the project hopes to contribute to the preservation, understanding, and future development of the Bishnupriya Manipuri language.

Appendix A — BPM to IPA Conversion Rules

The pronunciation system of the digital Bishnupriya Manipuri dictionary uses a rule-based conversion model that transforms orthographic text into International Phonetic Alphabet (IPA) representations.

These rules approximate the phonological structure of the language based on common pronunciation patterns.

1. Vowel Mapping

BPM Letter	IPA	Example
অ	ɔ	অল
আ	aː	আজি
ই	i	ইমান
ঈ	iː	ঈশ্বর
উ	u	উঠ
ঊ	uː	ঊন
এ	e	এজন
ঐ	oi	ঐতিহাসিক
ও	o	ওজন
ঔ	ou	ঔষধ

2. Consonant Mapping

BPM	IPA
ক	k
খ	kʰ
গ	g
ঘ	gʱ
চ	c
ছ	cʰ
জ	dʒ
ঝ	dʒʱ
ট	ʈ
ঠ	ʈʰ
ড	ɖ
ঢ	ɖʱ
ত	t̪
থ	t̪ʰ
দ	d̪
ধ	d̪ʱ
প	p
ফ	pʰ
ব	b
ভ	bʱ
ম	m
ন	n
র	r
ল	l
স	s
হ	h

3. Special Rules

Final schwa deletion in many word-final consonants
Nasal assimilation before velar consonants
Long vowels preserved in lexical roots
Consonant clusters simplified in IPA representation

Appendix B — Phoneme Inventory

The phoneme inventory of Bishnupriya Manipuri consists of vowel and consonant phonemes that combine to form the phonological system of the language.

1. Vowel Phonemes

Short Vowels	Long Vowels
i	iː
u	uː
e	eː
o	oː
ɔ	aː

2. Consonant Phonemes

Type	Phonemes
Stops	p b t̪ d̪ ʈ ɖ k g
Aspirated Stops	pʰ bʱ t̪ʰ d̪ʱ ʈʰ ɖʱ kʰ gʱ
Fricatives	s h
Nasals	m n ŋ
Liquids	r l
Glides	w j

Appendix C — Master Diphone Inventory

The diphone inventory represents all phoneme transitions required for the speech synthesis system.

Each diphone captures the acoustic transition between two phonemes.

Example Diphones

Diphone	Description
#-d	Word beginning before /d/
d-i	Transition from /d/ to /i/
i-ʃ	Transition from /i/ to /ʃ/
ʃ-aː	Transition from /ʃ/ to /aː/
aː-#	Word ending after /aː/

Structure of a Diphone


phoneme1 + phoneme2

Examples:

k-a
t̪-o
b-i
m-aː

Boundary Diphones

Special diphones represent word boundaries.

Diphone	Meaning
#-C	Word start
V-#	Word end

Appendix D — Recording Protocol for the Bishnupriya Manipuri Speech Corpus

Consistent and high-quality recordings are essential for building a diphone-based speech synthesis system. This appendix describes the recording protocol used in the Bishnupriya Manipuri Dictionary and Language Science Project.

1. Recording Environment

Quiet room with minimal background noise
Soft surfaces (curtains, carpets) to reduce echo
Microphone placed approximately 15–20 cm from speaker
Consistent recording position across sessions

2. Recommended Recording Equipment

Component	Recommendation
Microphone	Condenser microphone preferred
Audio Interface	USB audio interface or quality sound card
Recording Software	Audacity or equivalent
Pop Filter	Recommended to reduce plosive noise

3. Audio Format Standard

Sample Rate: 44100 Hz
Channels: Mono
Bit Depth: 16-bit PCM
File Format: WAV

All recordings must follow the same technical specifications to ensure compatibility within the diphone synthesis system.

4. Recording Procedure

Open the recording software.
Set the sample rate and recording format.
Speak the word clearly and at a moderate pace.
Leave a short silence before and after the word.
Save the file using the dictionary word as the filename.

5. Post-Processing

After recording, the audio files should be normalized and cleaned to maintain consistency across the corpus.

Typical processing steps:

trim leading and trailing silence
normalize volume level
verify sample rate and bit depth
remove background noise if necessary

Appendix E — Safe Filename Mapping System

The International Phonetic Alphabet (IPA) contains many characters that are not always suitable for filenames in web servers or operating systems.

To ensure reliable storage and retrieval of diphone audio files, the project uses a standardized ASCII-based filename mapping.

1. Purpose of Safe Filenames

avoid unsupported Unicode characters
ensure compatibility with web servers
simplify file management
enable predictable diphone file naming

2. Example Mapping

IPA Symbol	Safe ASCII Form
ʃ	sh
ʈ	t.
ɖ	d.
ŋ	ng
aː	aa
iː	ii
uː	uu

3. Diphone Filename Examples

IPA Diphone	Safe Filename
#-d	start_d.wav
d-i	d_i.wav
i-ʃ	i_sh.wav
ʃ-aː	sh_aa.wav
aː-#	aa_end.wav

This mapping system allows diphone audio files to be referenced consistently by both the server and the text-to-speech engine.

Appendix F — System Architecture of the Bishnupriya Manipuri Dictionary and TTS Platform

The Bishnupriya Manipuri language platform integrates lexicography, phonological processing, and speech synthesis within a web-based architecture.

1. Core Components

The system consists of several interconnected modules:

Dictionary database
BPM → IPA conversion engine
Phoneme and diphone generator
Diphone audio inventory
Validator and analysis tools
Web interface

2. Data Flow


User Search
      ↓
Dictionary Database
      ↓
Word Detail Page
      ↓
BPM → IPA Converter
      ↓
Phoneme Extraction
      ↓
Diphone Generation
      ↓
Load Diphone Audio Files
      ↓
Speech Playback

3. System Modules

Module	Function
Dictionary Database	Stores lexical entries
IPA Converter	Generates pronunciation
Diphone Engine	Builds diphone sequences
Diphone Audio Library	Contains recorded diphone sounds
Validator	Detects missing or inconsistent diphones
Web Interface	Provides search and playback functionality

4. Advantages of the Architecture

modular design
scalable dictionary expansion
automatic pronunciation generation
web-based accessibility

This architecture allows the dictionary platform to function simultaneously as a lexical reference, a phonological analysis tool, and a speech synthesis system.

Glossary of Linguistic Terms

This glossary defines selected linguistic and technical terms used throughout the volume Digital Bishnupriya Manipuri: Dictionary, Phonology, and Speech Technology.

A

Affricate — A consonant that begins as a stop and releases as a fricative.

Allophone — A phonetic variant of the same phoneme that does not change meaning.

B

Boundary Marker — A symbol representing the beginning or end of a word in phonological or diphone notation.

C

Cluster — A sequence of two or more consonants occurring together.

Coarticulation — The influence of neighboring sounds on one another during speech production.

Corpus — A structured collection of linguistic data used for analysis.

D

Diphone — A speech unit representing the transition between two adjacent phonemes.

Digitization — The process of converting printed or analog materials into digital form.

G

Grapheme — A written symbol representing part of the orthographic system of a language.

I

IPA (International Phonetic Alphabet) — A standardized notation system for representing speech sounds.

L

Lexicography — The practice and study of dictionary compilation.

Lexical Entry — A dictionary record representing a word or expression.

N

Nasal — A sound produced with airflow through the nose.

Normalization — The process of making data or audio consistent and standardized.

O

Orthography — The writing system and spelling conventions of a language.

P

Phoneme — The smallest contrastive sound unit in a language.

Phonetics — The study of speech sounds as physical and perceptual phenomena.

Phonology — The study of the sound system of a language.

Prosody — Features of speech such as rhythm, stress, and intonation.

R

Romanization — Representation of a non-Latin script through the Latin alphabet.

S

Safe Filename Mapping — A system for converting IPA or other linguistic symbols into filesystem-safe names.

Schwa — A reduced vowel commonly represented as /ə/ in IPA.

Schwa Deletion — The omission of an expected inherent or reduced vowel in pronunciation.

Segmentation — The division of text or audio into smaller units such as phonemes or diphones.

T

Text-to-Speech (TTS) — Technology that converts written text into spoken output.

Tokenization — The division of a string into linguistically meaningful units.

U

Unicode Normalization — Standardization of digital character representation for reliable processing.

V

Validator — A tool that checks whether expected data, rules, or files are present and consistent.

Bibliography / References

The following references provide linguistic, lexicographic, and technological context for the present volume.

International Phonetic Association. (1999). Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press.
Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing (2nd ed.). Prentice Hall.
Ladefoged, P., & Johnson, K. (2015). A Course in Phonetics (7th ed.). Cengage Learning.
Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press.
The Unicode Consortium. (2024). The Unicode Standard (Version 16.0). https://www.unicode.org
Boersma, P., & Weenink, D. (2024). Praat: Doing Phonetics by Computer [Computer software]. https://www.praat.org
FFmpeg Developers. (2023). FFmpeg [Computer software]. https://ffmpeg.org
সিংহ, ডি. এল., সিংহ, এস., ও সিংহ, এ. (২০২৩). বিষ্ণুপ্রিয়া মণিপুরী জাতীয় অভিধান [Bishnupriya Manipuri National Dictionary]. নিখিল বিষ্ণুপ্রিয়া মণিপুরী সাহিত্য পরিষদ.
সিংহ, এম. (২০২২). বিষ্ণুপ্রিয়া মণিপুরী ব্যাকরণ [Bishnupriya Manipuri Grammar]. নিখিল বিষ্ণুপ্রিয়া মণিপুরী সাহিত্য পরিষদ.
Sinha, K. P. (1986). An Etymological Dictionary of Bishnupriya Manipuri. Punthi Pustak.
Sinha, K. P. (2021). Bishnupriya Manipuri-English Dictionary. Bishnupriya Manipuri Sahitya Sabha.
Singha, U. K. (2026). Bishnupriya Manipuri Dictionary and Language Science Project [Internal research materials]. Lexical database, pronunciation rules, and system documentation.

Index of Technical Terms

This index lists selected technical terms and the chapters or appendices in which they are discussed substantially.

Affricates — Chapters 5, 7; Appendix B
Audio corpus — Chapters 6, 8; Appendix D
Boundary diphones — Chapters 5, 7, 8; Appendix C
Codebase guide — Toolkit pages; Appendix F
Data collection — Chapters 4, 6
Dictionary database — Chapters 4, 5, 9
Diphone inventory — Chapters 5, 7, 8; Appendix C
Diphone synthesis — Chapters 5, 7, 9
Digitization — Chapters 1, 4
Glossary — Back matter
IPA conversion — Chapters 5, 7; Appendix A
Language preservation — Chapters 1, 10
Lexicography — Chapters 1, 2, 4, 9
Normalization — Chapters 6, 8; Appendix D
Orthography — Chapters 2, 3, 5; Appendix A
Phoneme inventory — Chapters 5, 7; Appendix B
Phonological analysis — Chapters 5, 7, 9
Recording protocol — Chapter 6; Appendix D
Safe filename mapping — Chapters 7, 8; Appendix E
Schwa deletion — Chapters 3, 5; Appendix A
Search system — Chapter 9
Speech synthesis — Chapters 5–10; Appendix F
Spelling schools — Chapters 2, 3
System architecture — Chapters 5, 9; Appendix F
Text-to-speech (TTS) — Chapters 5, 7, 8, 9, 10; Appendix F
Unicode normalization — Chapters 4, 5; Appendix A
Validation — Chapters 7, 8; Appendix E
Word detail page — Chapter 9

List of Figures and Tables

The following list summarizes the principal figures, flow diagrams, and tables used throughout the volume.

Figures

Dictionary to language technology pipeline — Chapter 5
Word recording and diphone extraction workflow — Chapter 6
Phoneme to diphone conversion model — Chapter 7
Validator and rebuild workflow — Chapter 8
Digital dictionary platform architecture — Chapter 9
Future development path for Bishnupriya Manipuri language technology — Chapter 10
System architecture diagram — Appendix F

Tables

Primary dictionary sources — Chapter 4
Audio normalization parameters — Chapter 6
Diphone inventory examples — Chapter 7
Validator coverage examples — Chapter 8
Dictionary platform components — Chapter 9
Future development areas — Chapter 10
Vowel mapping table — Appendix A
Consonant mapping table — Appendix A
Phoneme inventory table — Appendix B
Master diphone inventory table — Appendix C
Recording equipment and technical standards — Appendix D
Safe filename conversion table — Appendix E
System module table — Appendix F

Production note. This page is the combined academic volume for the Bishnupriya Manipuri Dictionary and Language Science Project. It is optimized for browser reading and print-to-PDF export using the shared print stylesheet.

Bishnupriya Manipuri Dictionary and Language Science Project

Preface

Table of Contents

Bishnupriya Manipuri Dictionary and Language Science Project

Introduction

Source Dictionaries

Orthographic Diversity

Data Collection and Ongoing Work

Language Science Goals

A Living Dictionary

History of Bishnupriya Manipuri Dictionaries

Early Lexical Documentation

Dr. K. P. Sinha’s Bishnupriya Manipuri–English Dictionary

L. K. Sinha and Santosh Sinha’s Bishnupriya Manipuri Dictionary

Two Spelling Schools

The Challenge of Digitizing Printed Dictionaries

An Ongoing Process

Importance for Language Preservation

Orthographic Variation and Spelling Schools

Origins of Orthographic Diversity

The Two Major Spelling Schools

Examples of Spelling Variation

Why the Digital Dictionary Preserves Both Traditions

Role of Computational Analysis

Implications for Dictionary Design

Future Development of Orthographic Standards

A Descriptive Approach to Language Documentation

Data Collection and Lexical Corpus Building

Primary Dictionary Sources

Digitizing Printed Dictionaries

OCR Challenges

Manual Correction and Verification

Identifying Duplicate Entries

Building the Lexical Database

A Year of Continuous Work

Toward a Digital Language Corpus

Chapter 5 — From Dictionary to Language Technology

1. Dictionary as a Linguistic Database

2. Generating Pronunciation (BPM → IPA)

3. From IPA to Phonemes

4. Building Diphones

5. Dictionary-Driven Speech Synthesis

6. Why Dictionaries Matter for Speech Technology

7. Toward Integrated Language Infrastructure

Chapter 6 — Recording the Language: Building the Audio Corpus

1. Why an Audio Corpus is Necessary

2. Recording Dictionary Words

3. Audio Quality and Recording Conditions

4. Audio Normalization

5. Segmenting Recordings into Diphones

6. Challenges of Automatic Segmentation

7. Building the Diphone Inventory

8. Audio Validation

9. Rebuilding the Speech Corpus

10. Toward a Sustainable Speech Corpus

Chapter 7 — Designing the Bishnupriya Manipuri Diphone System

1. Phoneme Inventory of the Language

2. What is a Diphone?

3. Determining the Required Diphones

4. Extracting Diphones from Dictionary Words

5. Safe Filename System

6. Diphone Coverage Analysis

7. Validator Workflow

8. Integration with the Dictionary

9. Advantages of the Diphone Approach

Chapter 8 — Validator and Rebuild Workflow

1. The Need for Validation

2. Diphone Validator Tool

3. Coverage Analysis

4. Synchronization Problems

5. Unifying Conversion Rules

6. Rebuild Workflow

7. Automating the Workflow

8. Importance for Future Development

Chapter 9 — The Digital Bishnupriya Manipuri Dictionary Platform

1. Database Architecture

2. Search System

3. Word Detail Pages

4. Pronunciation Playback

5. Application Programming Interface (API)