The first open-source machine translation model for Bishnupriya Manipuri (BPY). Built with Meta AI's NLLB-200 and fine-tuned by the community.
Bishnupriya Manipuri is spoken by over 500,000 people across Assam, Tripura, Manipur, and Bangladesh. Despite this, it has zero support in Google Translate, Microsoft Translator, or any major AI model.
This project changes that. We fine-tuned Meta's NLLB-200-distilled-600M model using LoRA to create the world's first English → BPY translator. Version 8.5.3 runs on a dedicated HF endpoint for reliable access.
Developers can call the endpoint directly:
curl https://hcurzfqqhq3x21kg.us-east-1.aws.endpoints.huggingface.cloud \
-X POST \
-H "Authorization: Bearer hf_YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"inputs": "Fifty books", "parameters": {"src_lang": "eng_Latn", "tgt_lang": "ben_Beng"}}'
Languages without digital tools fade faster. Every app, website, and AI model that skips BPY pushes young speakers toward Hindi/English. This model puts BPY on the digital map.
BPY speakers can now translate English educational content, health information, and news into their mother tongue. No more relying on Assamese or Bengali as a bridge.
Unlike Big Tech models, this is 100% open source. The training data, model weights, and code are public. The BPY community owns and controls it.
The base model facebook/nllb-200-distilled-600M has strong Assamese/Bengali bias. It saw Bengali/Assamese millions of times during pretraining, but never saw BPY.
The breakthrough in V8.5.3: We isolated BPY numbers like য়াংখেইহান and duplicated them 1000x in training. This taught the decoder that BPY numbers are valid sentence starts, fixing the "লেরিকহান লেরিকহান" repetition bug from V8.5.2.
asm_Beng to ben_Beng to avoid Assamese biasben_Beng as the target token. The output script is Bengali but vocabulary/grammar is pure BPY. This was critical to beat Assamese contamination.
The model is on Hugging Face Hub with MIT license. Use it in any commercial or non-commercial project:
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
base = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
model = PeftModel.from_pretrained(base, "Emarthar/nllb-bpy-beng-v8-5-3")
tokenizer = AutoTokenizer.from_pretrained("Emarthar/nllb-bpy-beng-v8-5-3")
def translate(text):
tokenizer.src_lang = 'eng_Latn'
inputs = tokenizer(text, return_tensors='pt')
out = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids('ben_Beng'))
return tokenizer.decode(out[0], skip_special_tokens=True)
This is with accuracy. We need community help to reach 99%+. Here's how:
Found a wrong translation? Download training_data.csv, add the correct english,bpy_beng pair, and email it to us. We duplicate it 25x and retrain.
We need 5000+ pairs for 99% accuracy. Send us English-BPY sentence pairs on any topic: family, food, agriculture, daily life. Format: English sentence,BPY translation
Are you a BPY teacher or scholar? Review our outputs for tense, plurals, and honorifics. The model now handles number+noun correctly but needs more complex sentences.
Developers: Load V8.5.3, add your dialect data, train 1-2 epochs, push V8.6. All training scripts are in the repo.
| Version | Status | Key Improvement |
|---|---|---|
| V8.5.3 | ✅ Current | Fixed number+noun repetition. Outputs য়াংখেইহান লেরিক |
| V8.5.2 | ⚠️ Deprecated | Fixed grammar but had noun repetition bug |
| V8.5.1 | ⚠️ Deprecated | 50x weight, still Assamese contamination |
| V8.4 | ⚠️ Deprecated | Fixed Bengali bias. First pure BPY output |
| V9.0 | Planned | 5000+ pairs, r=32, handle complex sentences |
Model by: Emarthar/Uttam Singha/Bishnupriya Manipuri Language Development Project
Base model: Meta AI NLLB-200
License: MIT - Free for commercial use
Dataset: Community contributed BPY corpus
To submit corrections or volunteer: contact through Hugging Face or email via manipuri.com
Download Model View Training Data HF Bishnupriya Manipuri AI Community