Article 1

Abstract. A diphone-based text-to-speech engine can be implemented efficiently in a web environment using PHP on the server side and JavaScript on the client side. PHP handles dictionary lookup, IPA conversion, phoneme extraction, diphone generation, and API output, while JavaScript loads and concatenates the required diphone audio files for playback in the browser. This article describes the architecture of such a TTS engine for Bishnupriya Manipuri.

1. Introduction

A practical Bishnupriya Manipuri TTS engine must perform more than simple audio playback. It must convert a word into a sequence of diphone audio files and then play them back in the correct order.

This creates a full speech pipeline:

Dictionary word
    ↓
PHP pronunciation engine
    ↓
IPA
    ↓
Phonemes
    ↓
Diphones
    ↓
Safe filenames
    ↓
JavaScript audio playback

The system is therefore divided into two main components:

Server side (PHP): pronunciation analysis and diphone filename generation
Client side (JavaScript): audio loading, sequencing, and playback

2. Server-Side Architecture

On the server side, PHP handles the language-specific logic.

Its main tasks are:

read the BPM word from the dictionary
convert the word to IPA
extract phonemes
build the diphone list
convert diphones to safe filenames
return this information as JSON

Typical PHP output:

{
  "ipa": "diʃa",
  "phonemes": ["d","i","ʃ","a"],
  "diphones": ["#-d","d-i","i-ʃ","ʃ-a","a-#"],
  "diphone_files": [
    "sil-d.wav",
    "d-i.wav",
    "i-sh.wav",
    "sh-a.wav",
    "a-sil.wav"
  ]
}

3. The Role of the Analyze API

A dedicated API endpoint such as analyze_api.php is a central part of the system. This endpoint receives a word ID or BPM word and returns structured pronunciation data.

A typical request may look like:

/analyze_api.php?id=4716

The JSON response may contain:

normalized word
IPA
phoneme list
diphone list
safe filenames
trace information explaining the pronunciation

This API makes the system modular, because the same pronunciation engine can be reused by:

word pages
diphone batch tools
validator pages
fallback audio generators

4. IPA to Diphone Processing in PHP

Once PHP has produced the IPA string, the next step is phoneme tokenization.

Example:

IPA: diʃa
Phonemes: d i ʃ a
Diphones: #-d d-i i-ʃ ʃ-a a-#

The PHP pipeline typically includes these functions:

IPA tokenization
diphone generation
safe filename conversion

For example:

#-d   → sil-d.wav
d-i   → d-i.wav
i-ʃ   → i-sh.wav
ʃ-a   → sh-a.wav
a-#   → a-sil.wav

5. Safe Filename Mapping

A browser audio engine cannot reliably load raw IPA filenames because IPA symbols may not be safe in URLs or file systems. Therefore, the server converts each diphone into a safe filename.

IPA form	Safe form
#	sil
aː	aa
iː	ii
uː	uu
ʃ	sh
ŋ	ng
ɔ	aw
ə	schwa

Examples:

#-d   → sil-d.wav
ʃ-aː  → sh-aa.wav
aː-#  → aa-sil.wav

Once this mapping is fixed, every page in the system must use the same conversion rules.

6. Client-Side Playback with JavaScript

On the client side, JavaScript receives the diphone file list and plays the audio files in order.

A simplified playback logic is:

1. Request pronunciation JSON from the API
2. Read diphone filename list
3. Build audio URLs
4. Load each WAV file
5. Play them sequentially

This can be implemented using the HTML5 Audio API or the Web Audio API.

7. Sequential Audio Playback

The simplest approach is to play each WAV file one after another.

Example file sequence:

sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

Each file is loaded and played, and when one file ends, the next begins.

Pseudo-code:

load diphone file list
for each file:
    create audio object
    wait until previous file ends
    play next file

This approach is easy to implement but may introduce tiny gaps between segments.

8. Improved Playback with Web Audio API

For smoother synthesis, the Web Audio API can be used. This allows:

preloading all diphone files
buffer-based playback
gap reduction
crossfade smoothing

A more advanced playback engine can:

fetch all diphone WAV files
decode audio buffers
schedule them back-to-back
optionally overlap by 5–10 ms

This reduces clicks and improves continuity.

9. Missing Diphone Handling

A robust TTS engine must handle missing diphone files gracefully.

When a file is missing, the system may:

show an error message
display the missing filenames
skip the missing segment
fall back to whole-word audio if available

Example missing-file message:

Missing (2) — Coverage: 60%
sil-d.wav
aa-sil.wav

This type of feedback is extremely useful during diphone inventory rebuilding.

10. Integration into Dictionary Pages

On a dictionary word page, the TTS button can trigger a JavaScript call that:

reads the dictionary word or entry ID
calls the pronunciation API
receives diphone files
plays them automatically

Example interface flow:

Word page
   ↓
Click “Play TTS”
   ↓
Fetch pronunciation JSON
   ↓
Receive diphone_files[]
   ↓
Play WAV files in sequence

11. Why Consistency Across Pages Matters

One of the biggest engineering problems in diphone TTS systems is inconsistency. If different pages generate different IPA forms or different safe filenames, the audio files will not match.

Therefore, all pages must use:

the same IPA converter
the same phoneme tokenizer
the same diphone generator
the same safe filename rules

A shared engine file, such as a common PHP module, helps enforce this consistency.

12. Recommended System Structure

A clean Bishnupriya Manipuri TTS web implementation may use the following structure:

/bpm_converter_core.php
/diphone_engine.php
/analyze_api.php
/word.php
/diphone_validator.php
/audio/diphone/

Roles:

bpm_converter_core.php → pronunciation engine
diphone_engine.php → diphone tokenization and safe filename logic
analyze_api.php → JSON pronunciation output
word.php → user interface and playback button
diphone_validator.php → inventory testing
/audio/diphone/ → actual WAV files

13. Practical Playback Example

Input word: দিশা

Server output:

IPA: diʃa
Phonemes: d i ʃ a
Diphones:
#-d
d-i
i-ʃ
ʃ-a
a-#
Safe filenames:
sil-d.wav
d-i.wav
i-sh.wav
sh-a.wav
a-sil.wav

JavaScript playback order:

1. /audio/diphone/sil-d.wav
2. /audio/diphone/d-i.wav
3. /audio/diphone/i-sh.wav
4. /audio/diphone/sh-a.wav
5. /audio/diphone/a-sil.wav

These files are played sequentially to synthesize the word.

14. Advantages of PHP + JavaScript for TTS

This implementation model has several advantages:

easy web integration
no special client installation required
server-side pronunciation control
browser-based playback
good compatibility with dictionary websites

It is also highly suitable for research projects and language preservation platforms.

15. Limitations

A diphone TTS engine implemented this way still has some limitations:

quality depends heavily on diphone recording consistency
missing files interrupt playback
plain sequential playback may sound slightly segmented
rare clusters may require additional recordings

These limitations can be reduced through:

better normalization
crossfade smoothing
clean diphone inventory design
systematic validation

16. Conclusion

A Bishnupriya Manipuri TTS engine can be implemented effectively in PHP and JavaScript using a diphone-based design.

The architecture is:

PHP:
word → IPA → phonemes → diphones → filenames

JavaScript:
filenames → audio loading → playback → speech

This design makes it possible to integrate speech synthesis directly into an online dictionary and provides a practical path for language technology in an under-resourced language.

Article 9
Integrating Speech into an Online Bishnupriya Manipuri Dictionary

The next article explains how dictionary pages, word records, audio tools, and TTS playback can be combined into a unified web platform.

Bishnupriya Manipuri Research Archive

Language, linguistics, dictionary, IPA, phonemes, diphones, and speech technology