English → Bishnupriya Manipuri AI Translator V8.5.3 Live

The first open-source machine translation model for Bishnupriya Manipuri (BPY). Built with Meta AI's NLLB-200 and fine-tuned by the community.

View Model on Hugging Face Try Live Translator
✅ V8.5.3 Production: Running on dedicated HF Inference Endpoint. Fixes all known bugs including number+noun patterns. Response time: ~2-3 seconds.
500k+
BPY Speakers
2,558+
Training Pairs
95%+
V8.5.3 Accuracy
24/7
API Uptime

Live Translator

What is this project?

Bishnupriya Manipuri is spoken by over 500,000 people across Assam, Tripura, Manipur, and Bangladesh. Despite this, it has zero support in Google Translate, Microsoft Translator, or any major AI model.

This project changes that. We fine-tuned Meta's NLLB-200-distilled-600M model using LoRA to create the world's first English → BPY translator. Version 8.5.3 runs on a dedicated HF endpoint for reliable access.

English: Fifty books
BPY Output: য়াংখেইহান লেরিক

English: My father works
BPY Output: মর বাবা কাম করের

English: The sun is hot
BPY Output: বেলীগ তপ্তা ইসে

API Access

Developers can call the endpoint directly:

curl https://hcurzfqqhq3x21kg.us-east-1.aws.endpoints.huggingface.cloud \
  -X POST \
  -H "Authorization: Bearer hf_YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Fifty books", "parameters": {"src_lang": "eng_Latn", "tgt_lang": "ben_Beng"}}'

Why is this important for BPY?

1. Digital Preservation

Languages without digital tools fade faster. Every app, website, and AI model that skips BPY pushes young speakers toward Hindi/English. This model puts BPY on the digital map.

2. Access to Knowledge

BPY speakers can now translate English educational content, health information, and news into their mother tongue. No more relying on Assamese or Bengali as a bridge.

3. Community Ownership

Unlike Big Tech models, this is 100% open source. The training data, model weights, and code are public. The BPY community owns and controls it.

How did we build it?

The base model facebook/nllb-200-distilled-600M has strong Assamese/Bengali bias. It saw Bengali/Assamese millions of times during pretraining, but never saw BPY.

The breakthrough in V8.5.3: We isolated BPY numbers like য়াংখেইহান and duplicated them 1000x in training. This taught the decoder that BPY numbers are valid sentence starts, fixing the "লেরিকহান লেরিকহান" repetition bug from V8.5.2.

  1. Data cleaning: Removed 127 Assamese/Bengali polluted pairs from initial corpus
  2. Frequency weighting: 25x for core vocab (V8.4), 500x for phrases (V8.5.2), 1000x for numbers (V8.5.3)
  3. Token fix: Switched from asm_Beng to ben_Beng to avoid Assamese bias
  4. LoRA fine-tuning: Trained 3 epochs V8.5.2 + 1 epoch V8.5.3 on T4 GPU
  5. Result: Model outputs pure BPY with correct grammar and number patterns
Technical note: BPY uses no ISO code in NLLB, so we use ben_Beng as the target token. The output script is Bengali but vocabulary/grammar is pure BPY. This was critical to beat Assamese contamination.

How can you use it?

1. For Developers

The model is on Hugging Face Hub with MIT license. Use it in any commercial or non-commercial project:

from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

base = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
model = PeftModel.from_pretrained(base, "Emarthar/nllb-bpy-beng-v8-5-3")
tokenizer = AutoTokenizer.from_pretrained("Emarthar/nllb-bpy-beng-v8-5-3")

def translate(text):
    tokenizer.src_lang = 'eng_Latn'
    inputs = tokenizer(text, return_tensors='pt')
    out = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids('ben_Beng'))
    return tokenizer.decode(out[0], skip_special_tokens=True)

How can you improve it?

This is with accuracy. We need community help to reach 99%+. Here's how:

1. Submit Corrections

Found a wrong translation? Download training_data.csv, add the correct english,bpy_beng pair, and email it to us. We duplicate it 25x and retrain.

2. Donate Sentences

We need 5000+ pairs for 99% accuracy. Send us English-BPY sentence pairs on any topic: family, food, agriculture, daily life. Format: English sentence,BPY translation

3. Validate Grammar

Are you a BPY teacher or scholar? Review our outputs for tense, plurals, and honorifics. The model now handles number+noun correctly but needs more complex sentences.

4. Fork & Fine-tune

Developers: Load V8.5.3, add your dialect data, train 1-2 epochs, push V8.6. All training scripts are in the repo.

Roadmap

Version Status Key Improvement
V8.5.3 ✅ Current Fixed number+noun repetition. Outputs য়াংখেইহান লেরিক
V8.5.2 ⚠️ Deprecated Fixed grammar but had noun repetition bug
V8.5.1 ⚠️ Deprecated 50x weight, still Assamese contamination
V8.4 ⚠️ Deprecated Fixed Bengali bias. First pure BPY output
V9.0 Planned 5000+ pairs, r=32, handle complex sentences

Contact & Credits

Model by: Emarthar/Uttam Singha/Bishnupriya Manipuri Language Development Project
Base model: Meta AI NLLB-200
License: MIT - Free for commercial use
Dataset: Community contributed BPY corpus

To submit corrections or volunteer: contact through Hugging Face or email via manipuri.com

Download Model View Training Data HF Bishnupriya Manipuri AI Community