Chapter 4 — Data Collection and Lexical Corpus Building

Bishnupriya Manipuri Dictionary and Language Science Project

Data Collection and Lexical Corpus Building

The foundation of any dictionary project is the collection of lexical data. For the Bishnupriya Manipuri Dictionary and Language Science Project, data collection has been one of the most challenging and time-consuming stages.

Unlike languages that already possess large digital corpora, most Bishnupriya Manipuri lexical material exists only in printed books, handwritten notes, or scattered personal collections. As a result, the first step of the project involved locating and gathering reliable lexical sources.

The goal was not merely to digitize one dictionary, but to create a unified lexical corpus that combines multiple sources while preserving their linguistic differences.

Primary Dictionary Sources

The digital dictionary project primarily draws from three major lexical sources.

Each of these sources contributes valuable information, but they also present different orthographic traditions and organizational structures.

Therefore, the process of building a unified digital corpus requires careful comparison and normalization of entries.

Digitizing Printed Dictionaries

The transition from printed dictionaries to a digital database is not a straightforward process.

Printed dictionaries often contain complex page layouts, including columns, abbreviations, cross-references, and specialized formatting. These features can make automated digitization difficult.

To convert the printed dictionaries into machine-readable form, several steps were required:

Although OCR technology can accelerate digitization, it often introduces errors when processing Indic scripts or complex typographic layouts.

OCR Challenges

One of the most significant obstacles in digitizing Bishnupriya Manipuri dictionaries is the limitation of OCR technology.

OCR systems are typically optimized for widely used languages and scripts, and they may struggle with specialized orthographic features found in Bishnupriya Manipuri texts.

Common OCR problems include:

As a result, every OCR-generated entry must be reviewed manually before it can be added to the lexical database.

Manual Correction and Verification

Because of the limitations of OCR, manual correction forms a central part of the digitization process.

Each dictionary entry must be carefully examined to ensure that:

This process requires not only technical attention but also linguistic judgment. In many cases, the editor must consult multiple sources to confirm the correct spelling or meaning of a word.

Identifying Duplicate Entries

When multiple dictionaries are combined into a single database, duplicate entries naturally appear.

However, identifying duplicates is not always straightforward. Two entries may appear different at first glance because of spelling variation, yet represent the same lexical item.

For this reason, the project employs several strategies to detect potential duplicates:

This process helps maintain the integrity of the lexical corpus while preserving meaningful spelling variants.

Building the Lexical Database

Once entries have been digitized and verified, they are stored in a structured database.

Each entry typically contains several fields, including:

This structured format allows the dictionary to support both traditional lexical lookup and advanced computational analysis.

A Year of Continuous Work

The creation of the digital dictionary has involved more than a year of continuous data collection and correction.

During this period, thousands of entries have been reviewed, corrected, and organized. Even after extensive work, the process remains ongoing.

Language documentation is rarely finished. New words appear, existing entries require refinement, and additional sources may become available.

The project therefore treats the dictionary not as a fixed product, but as a living and evolving lexical archive.

Toward a Digital Language Corpus

Beyond the immediate goal of building a dictionary, the lexical database also serves as the foundation for a broader digital language corpus.

Such a corpus can support research in areas such as:

By transforming traditional dictionaries into a structured digital corpus, the project opens new possibilities for linguistic research and technological development.