Chapter 4 — Data Collection and Lexical Corpus Building
Bishnupriya Manipuri Dictionary and Language Science Project
Data Collection and Lexical Corpus Building
The foundation of any dictionary project is the collection of lexical data. For the Bishnupriya Manipuri Dictionary and Language Science Project, data collection has been one of the most challenging and time-consuming stages.
Unlike languages that already possess large digital corpora, most Bishnupriya Manipuri lexical material exists only in printed books, handwritten notes, or scattered personal collections. As a result, the first step of the project involved locating and gathering reliable lexical sources.
The goal was not merely to digitize one dictionary, but to create a unified lexical corpus that combines multiple sources while preserving their linguistic differences.
Primary Dictionary Sources
The digital dictionary project primarily draws from three major lexical sources.
- Dr. K. P. Sinha’s Bishnupriya Manipuri–English dictionary, which provides bilingual lexical explanations.
- L. K. Sinha and Santosh Sinha’s Bishnupriya Manipuri dictionary, which provides definitions within the language itself.
- Additional lexical material collected from community usage, literary texts, and personal word lists.
Each of these sources contributes valuable information, but they also present different orthographic traditions and organizational structures.
Therefore, the process of building a unified digital corpus requires careful comparison and normalization of entries.
Digitizing Printed Dictionaries
The transition from printed dictionaries to a digital database is not a straightforward process.
Printed dictionaries often contain complex page layouts, including columns, abbreviations, cross-references, and specialized formatting. These features can make automated digitization difficult.
To convert the printed dictionaries into machine-readable form, several steps were required:
- high-resolution scanning of printed pages
- optical character recognition (OCR)
- manual verification of OCR output
- reconstruction of entry structures
Although OCR technology can accelerate digitization, it often introduces errors when processing Indic scripts or complex typographic layouts.
OCR Challenges
One of the most significant obstacles in digitizing Bishnupriya Manipuri dictionaries is the limitation of OCR technology.
OCR systems are typically optimized for widely used languages and scripts, and they may struggle with specialized orthographic features found in Bishnupriya Manipuri texts.
Common OCR problems include:
- incorrect recognition of consonant conjuncts
- confusion between visually similar characters
- missing diacritic marks
- broken words across line boundaries
- misinterpretation of punctuation
As a result, every OCR-generated entry must be reviewed manually before it can be added to the lexical database.
Manual Correction and Verification
Because of the limitations of OCR, manual correction forms a central part of the digitization process.
Each dictionary entry must be carefully examined to ensure that:
- the word is spelled correctly
- the meaning is accurately captured
- cross-references are preserved
- typographical errors are removed
This process requires not only technical attention but also linguistic judgment. In many cases, the editor must consult multiple sources to confirm the correct spelling or meaning of a word.
Identifying Duplicate Entries
When multiple dictionaries are combined into a single database, duplicate entries naturally appear.
However, identifying duplicates is not always straightforward. Two entries may appear different at first glance because of spelling variation, yet represent the same lexical item.
For this reason, the project employs several strategies to detect potential duplicates:
- comparison of normalized spellings
- phonological comparison through IPA conversion
- manual review of similar entries
This process helps maintain the integrity of the lexical corpus while preserving meaningful spelling variants.
Building the Lexical Database
Once entries have been digitized and verified, they are stored in a structured database.
Each entry typically contains several fields, including:
- the original Bishnupriya Manipuri word
- meaning or definition
- source dictionary
- part of speech
- phonological representation
- cross-references
This structured format allows the dictionary to support both traditional lexical lookup and advanced computational analysis.
A Year of Continuous Work
The creation of the digital dictionary has involved more than a year of continuous data collection and correction.
During this period, thousands of entries have been reviewed, corrected, and organized. Even after extensive work, the process remains ongoing.
Language documentation is rarely finished. New words appear, existing entries require refinement, and additional sources may become available.
The project therefore treats the dictionary not as a fixed product, but as a living and evolving lexical archive.
Toward a Digital Language Corpus
Beyond the immediate goal of building a dictionary, the lexical database also serves as the foundation for a broader digital language corpus.
Such a corpus can support research in areas such as:
- phonological analysis
- lexical frequency studies
- automatic pronunciation generation
- speech synthesis
- language preservation
By transforming traditional dictionaries into a structured digital corpus, the project opens new possibilities for linguistic research and technological development.