Bhashini.ai Streaming STT provides real-time speech recognition over WebSocket, with server-side Voice Activity Detection (VAD) and automatic speech segmentation.
The API supports:
Continuous microphone streaming
Interim (partial) transcripts
Final transcripts per utterance
Speech state events (speech_start, speech_pause, speech_resume, speech_end)
Configurable VAD behavior
The Streaming STT WebSocket API accepts JSON control messages and binary audio frames over a single WebSocket connection. Clients must send a start event before streaming audio. Audio frames are decoded and optionally segmented using server-side VAD. Transcripts are streamed back asynchronously as interim or final results. Clients may finalize or stop the stream at any time.
Real-time Audio Source
(Client Mic, SIP/PBX (Asterisk), WebRTC (LiveKit))
↓
WebSocket (binary audio frames)
↓
Server-side Audio Decoder
↓
Voice Activity Detection (VAD)
↓
Speech Segmentation FSM
↓
ASR Engine
↓
Interim / Final Transcripts
Client → Server: WebSocket CONNECT /stt/stream
Server → Client: Connection accepted
The client must send a start event before streaming audio.
{
"event": "start",
"language": "Kannada",
"useVad": true,
"inputEncoding": {
"encoding": "linear16",
"samplingRate": 16000,
"bitsPerSample": 16,
"numChannels": 1
},
"vadConfig": {
"pStart": 0.6,
"pauseMs": 400
},
"interimIntervalMs": 200
}
Any omitted fields automatically use default values.
Audio is sent as binary WebSocket messages
Each message contains raw audio bytes
Chunk size is flexible (recommended: 100–200 ms of audio)
Client → Server: binary(audio_chunk)
Client → Server: binary(audio_chunk)
...
Decode Audio
→ Run VAD per frame
→ Update Speech FSM
→ Emit speech events
→ Accumulate audio
→ Run ASR
Speech Events
{ "event": "speech_start" }
{ "event": "speech_pause" }
{ "event": "speech_resume" }
{ "event": "speech_end" }
Interim Transcript
{
"transcript": "ನಮಸ್ಕಾರ",
"isFinal": false
}
Final Transcript
{
"transcript": "ನಮಸ್ಕಾರ ನನ್ನ ಹೆಸರು ಶಿವಕುಮಾರ",
"isFinal": true
}
{ "event": "finalize" }
or
{ "event": "stop" }
Bhashini.ai uses a VAD-driven finite state machine (FSM) to automatically segment speech.
This state machine continuously analyzes incoming audio and automatically decides when speech starts, pauses, resumes, and ends, based on Voice Activity Detection (VAD) probabilities and timing thresholds.
Think of the FSM as a careful listener:
It does not react instantly to noise
It waits for confirmation before declaring speech
It tolerates short pauses
It automatically finalizes speech when silence lasts too long
Default state when no speech is detected
The system is listening for signs of speech
Audio is ignored unless it looks like speech
Transition
→ STARTING when speech probability rises above pStart
Speech is suspected but not yet confirmed
Audio is temporarily buffered (pre-roll) but not sent to ASR yet
Prevents false triggers from noise or short bursts
Outcomes
→ SPEAKING
When speech probability stays high for startConfirmMs
→ Speech is officially started
→ Buffered audio is replayed to ASR
→ IDLE
If probability drops below pSilent
→ Buffered audio is discarded
Audio is streamed to ASR continuously
Speech is considered active
Possible Transitions
→ PAUSED
If speech probability drops below pContinue for pauseMs
→ Temporary silence detected
→ ENDING
If speech duration exceeds maxUtteranceMs
→ Prevents excessively long utterances
Silence detected, but not long enough to end speech
Audio is still collected
Enables natural pauses during speaking
Possible Transitions
→ SPEAKING
If speech probability rises above pContinue
→ Speech resumes
→ ENDING
If silence lasts longer than endMs
→ Speech is considered finished
The utterance is finalized
Final audio is sent to ASR
A SPEECH_END event is emitted
FSM resets
Transition
→ IDLE
Ready for the next utterance
Speech is automatically finalized when:
Silence exceeds endMs
Utterance exceeds maxUtteranceMs
Client stops sending audio
Client sends finalize
Bhashini.ai sends interim transcripts:
During speech, every interimIntervalMs
On speech pause
Only if new audio has arrived
This avoids:
Repeating identical transcripts
Spamming during silence
Dependence on client-side silence frames
Send start event before audio
Stream audio continuously during speech
Send finalize or stop at end
Do not rely solely on client-side VAD
Expect server-side timeouts to finalize speech
Avoids false speech detection
Handles natural pauses smoothly
Supports continuous streaming
Automatically finalizes speech
Works even if the client does not send silence audio
In summary, the FSM listens cautiously, confirms speech before acting, tolerates short pauses, and automatically finalizes speech when silence lasts too long.
{
"pStart": 0.60, // Strong confidence needed to start speech
"pContinue": 0.45, // Moderate confidence to continue speech
"pSilent": 0.20, // Very low confidence = silence
"startConfirmMs": 120, // Debounce before speech start
"pauseMs": 400, // Short silence = pause
"endMs": 1200, // Long silence = end speech
"maxUtteranceMs": 20000 // Safety limit
"prerollMs": 240 // Audio buffered before speech start
"frameDurationMs": 32 // Duration of one VAD frame
"expMovingAverageAlpha": 0.6 //VAD probability smoothing
}
Setting Value
Encoding linear16
Sample Rate 16000 Hz
Bits/Sample 16
Channels 1 (mono)
Chunk Size 100–200 ms