English → Bishnupriya Manipuri AI Translator

The first open-source machine translation model for Bishnupriya Manipuri (BPY). Built with Meta AI's NLLB-200 and fine-tuned by the community.

Try the Model on Hugging Face Use in Your Project
500k+
BPY Speakers
2,558
Training Pairs
67%
V8.4 Accuracy
0
Prior NLP Tools

What is this project?

Bishnupriya Manipuri is spoken by over 500,000 people across Assam, Tripura, Manipur, and Bangladesh. Despite this, it has zero support in Google Translate, Microsoft Translator, or any major AI model.

This project changes that. We fine-tuned Meta's NLLB-200-distilled-600M model using LoRA to create the world's first English → BPY translator. Version 8.4 outputs pure Bishnupriya Manipuri, not Assamese or Bengali.

English: Water is important
BPY Output: পানীহান দরকারি

English: The sky is blue
BPY Output: হাগহান নীলুৱাহান

English: My name is Arunita
BPY Output: মর নাংহান অরুনিতা

Why is this important for BPY?

1. Digital Preservation

Languages without digital tools fade faster. Every app, website, and AI model that skips BPY pushes young speakers toward Hindi/English. This model puts BPY on the digital map.

2. Access to Knowledge

BPY speakers can now translate English educational content, health information, and news into their mother tongue. No more relying on Assamese or Bengali as a bridge.

3. Community Ownership

Unlike Big Tech models, this is 100% open source. The training data, model weights, and code are public. The BPY community owns and controls it.

How did we build it?

The base model facebook/nllb-200-distilled-600M has strong Assamese/Bengali bias. It saw the word "জল" for water millions of times, but never saw BPY "পানীহান".

The breakthrough: We multiplied critical BPY vocabulary 25x in the training data. This gave LoRA weights enough signal to override NLLB's 600M parameter bias.

  1. Data cleaning: Removed 127 Assamese/Bengali polluted pairs from initial corpus
  2. Frequency weighting: Duplicated core words like পানীহান, হাগহান, মর to teach the model BPY vocab
  3. LoRA fine-tuning: Trained 3 epochs on T4 GPU, achieving val_loss 0.753
  4. Result: Model stopped outputting জলহান and correctly outputs পানীহান
Technical note: BPY uses no ISO code in NLLB, so we use asm_Beng as the target token. The output script is Bengali but vocabulary/grammar is pure BPY.

How can you use it?

1. For Developers

The model is on Hugging Face Hub with Apache 2.0 license. Use it in any commercial or non-commercial project:

from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

base = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
model = PeftModel.from_pretrained(base, "Emarthar/nllb-bpy-beng-v8_4")
tokenizer = AutoTokenizer.from_pretrained("Emarthar/nllb-bpy-beng-v8_4")
View Full Code + Docs

2. For BPY Speakers

Test the model directly on Hugging Face. Type English, get BPY back instantly. No coding needed.

Try Live Demo

How can you improve it?

This is V8.4 with 67% accuracy. We need community help to reach 95%+. Here's how:

1. Submit Corrections

Found a wrong translation? Download training_data_v8_4.csv, add the correct english,bpy_beng pair, and email it to us. We duplicate it 25x and retrain.

2. Donate Sentences

We need 5000+ pairs for 90% accuracy. Send us English-BPY sentence pairs on any topic: family, food, agriculture, daily life. Format: English sentence,BPY translation

3. Validate Grammar

Are you a BPY teacher or scholar? Review our outputs for tense, plurals, and honorifics. The model currently handles simple SOV sentences best.

4. Fork & Fine-tune

Developers: Load V8.4, add your dialect data, train 1-2 epochs, push V8.5. All training scripts are in the repo.

Roadmap

Version Status Key Improvement
V8.4 ✅ Released Fixed Bengali bias. Outputs পানীহান not জলহান
V8.5 In Progress Add sun/hot/father/work vocab. Target 80% accuracy
V9.0 Planned 5000+ pairs, r=32, handle complex sentences

Contact & Credits

Model by: Emarthar/Uttam Singha/Bishnupriya Manipuri Language Devlopement Project
Base model: Meta AI NLLB-200
License: Apache 2.0 - Free for commercial use
Dataset: Community contributed BPY corpus

To submit corrections or volunteer: contact through Hugging Face or email via manipuri.com

Download Model View Training Data HF Bishnupriya Manipuri AI Community page