Chapter 4 — Data Collection and Lexical Corpus Building

Bishnupriya Manipuri Dictionary and Language Science Project

Data Collection and Lexical Corpus Building

The foundation of any dictionary project is the collection of lexical data. For the Bishnupriya Manipuri Dictionary and Language Science Project, data collection has been one of the most challenging and time-consuming stages.

Unlike languages that already possess large digital corpora, most Bishnupriya Manipuri lexical material exists only in printed books, handwritten notes, or scattered personal collections. As a result, the first step of the project involved locating and gathering reliable lexical sources.

The goal was not merely to digitize one dictionary, but to create a unified lexical corpus that combines multiple sources while preserving their linguistic differences.

Primary Dictionary Sources

The digital dictionary project primarily draws from three major lexical sources.

Dr. K. P. Sinha’s Bishnupriya Manipuri–English dictionary, which provides bilingual lexical explanations.
L. K. Sinha and Santosh Sinha’s Bishnupriya Manipuri dictionary, which provides definitions within the language itself.
Additional lexical material collected from community usage, literary texts, and personal word lists.

Each of these sources contributes valuable information, but they also present different orthographic traditions and organizational structures.

Therefore, the process of building a unified digital corpus requires careful comparison and normalization of entries.

Digitizing Printed Dictionaries

The transition from printed dictionaries to a digital database is not a straightforward process.

Printed dictionaries often contain complex page layouts, including columns, abbreviations, cross-references, and specialized formatting. These features can make automated digitization difficult.

To convert the printed dictionaries into machine-readable form, several steps were required:

high-resolution scanning of printed pages
optical character recognition (OCR)
manual verification of OCR output
reconstruction of entry structures

Although OCR technology can accelerate digitization, it often introduces errors when processing Indic scripts or complex typographic layouts.

OCR Challenges

One of the most significant obstacles in digitizing Bishnupriya Manipuri dictionaries is the limitation of OCR technology.

OCR systems are typically optimized for widely used languages and scripts, and they may struggle with specialized orthographic features found in Bishnupriya Manipuri texts.

Common OCR problems include:

incorrect recognition of consonant conjuncts
confusion between visually similar characters
missing diacritic marks
broken words across line boundaries
misinterpretation of punctuation

As a result, every OCR-generated entry must be reviewed manually before it can be added to the lexical database.

Manual Correction and Verification

Because of the limitations of OCR, manual correction forms a central part of the digitization process.

Each dictionary entry must be carefully examined to ensure that:

the word is spelled correctly
the meaning is accurately captured
cross-references are preserved
typographical errors are removed

This process requires not only technical attention but also linguistic judgment. In many cases, the editor must consult multiple sources to confirm the correct spelling or meaning of a word.

Identifying Duplicate Entries

When multiple dictionaries are combined into a single database, duplicate entries naturally appear.

However, identifying duplicates is not always straightforward. Two entries may appear different at first glance because of spelling variation, yet represent the same lexical item.

For this reason, the project employs several strategies to detect potential duplicates:

comparison of normalized spellings
phonological comparison through IPA conversion
manual review of similar entries

This process helps maintain the integrity of the lexical corpus while preserving meaningful spelling variants.

Building the Lexical Database

Once entries have been digitized and verified, they are stored in a structured database.

Each entry typically contains several fields, including:

the original Bishnupriya Manipuri word
meaning or definition
source dictionary
part of speech
phonological representation
cross-references

This structured format allows the dictionary to support both traditional lexical lookup and advanced computational analysis.

A Year of Continuous Work

The creation of the digital dictionary has involved more than a year of continuous data collection and correction.

During this period, thousands of entries have been reviewed, corrected, and organized. Even after extensive work, the process remains ongoing.

Language documentation is rarely finished. New words appear, existing entries require refinement, and additional sources may become available.

The project therefore treats the dictionary not as a fixed product, but as a living and evolving lexical archive.

Toward a Digital Language Corpus

Beyond the immediate goal of building a dictionary, the lexical database also serves as the foundation for a broader digital language corpus.

Such a corpus can support research in areas such as:

phonological analysis
lexical frequency studies
automatic pronunciation generation
speech synthesis
language preservation

By transforming traditional dictionaries into a structured digital corpus, the project opens new possibilities for linguistic research and technological development.

← Chapter 3 — Orthographic Variation and Spelling Schools

Combined Book

Chapter 5 — From Dictionary to Language Technology →

Bishnupriya Manipuri Research Archive

Language, linguistics, dictionary, IPA, phonemes, diphones, and speech technology