Unlike Automatic Speech Recognition (ASR), where a small number of highly accurate engines can effectively serve global needs, Text-to-Speech (TTS) is inherently human-facing and experience-driven. TTS output is not just “understood” by machines—it is heard, felt, and experienced by people.
If every application, device, and service across the world were to speak using the same handful of voices from a few global providers, the experience would quickly become monotonous and impersonal. Just as brands differentiate themselves through visual identity and tone, voice has emerged as a powerful dimension of brand expression.
Enterprises, governments, and digital platforms increasingly seek distinct, culturally resonant, and language-specific voices that reflect their identity and connect authentically with their audience. This creates a natural and healthy ecosystem where multiple high-quality TTS voices are not only desirable but essential.
By creating over 100 unique TTS voices across 22 Indian languages, Bhashini.ai enables rich diversity, inclusivity, and personalization—ensuring that human–AI interactions sound natural, engaging, and truly representative of India’s linguistic and cultural plurality.
Building a diverse and high-quality portfolio of TTS voices requires a carefully designed, end-to-end process that balances human artistry, linguistic accuracy, and technical rigor. To create over 100 unique voices across India’s languages, Bhashini.ai followed a structured, multi-stage approach.
The process begins with identifying professional voice artists who demonstrate strong vocal quality and native-level language proficiency. Raw, unedited voice samples are collected directly from artists using the Vaak Sangraahak mobile application, ensuring authenticity and consistency. Shortlisted artists then undergo a reading proficiency assessment to evaluate pronunciation, fluency, and expressive clarity, followed by formal onboarding through a Speech Recording Consent Agreement.
Once voice artists are selected, they undergo a detailed orientation that explains how Text-to-Speech models are built and outlines best-practice recording guidelines that directly contribute to the naturalness and expressiveness of the trained TTS output.
Following this orientation, speech recordings are carried out in professional studios equipped with state-of-the-art audio recording infrastructure, ensuring clean, high-fidelity source data.
Throughout the recording process, purpose-built productivity-enhancing tools are used to streamline speech capture and simplify real-time validation, enabling consistent, high-quality recordings at scale.
The first step in creating high-quality TTS voices is the careful identification of professional voice artists who demonstrate both excellent vocal characteristics and strong proficiency in their native language. To enable a fair, consistent, and scalable selection process, Bhashini.ai uses a structured, multi-stage approach.
Collection of Raw Voice Samples Using the Purpose-built “Vaak Sangraahak” Mobile App
To ensure an unbiased evaluation of voice talent, Bhashini.ai developed a custom mobile application called “Vaak Sangraahak” for direct collection of raw, unedited voice samples from artists. For each language, a curated set of 50 phonetically rich sentences—covering multiple speaking styles such as conversational, news-style, storytelling, and emotional tones—are uploaded to the application’s backend system.
Prospective voice artists install the Vaak Sangraahak app from the Google Play Store and register by providing basic details such as name, contact information, native language, and prior voice-over experience. After selecting the language for which they wish to submit samples, artists are guided through a simple recording workflow. The app displays one sentence at a time and allows the artist to record the sentence in a quiet environment. Each recording is saved locally, after which the next sentence is presented. Artists may re-record any sentence if required, ensuring accuracy and clarity. Once all 50 sentences are recorded, the artist uploads the recordings and profile details to the backend system for review.
Review and Assessment of Voice Samples
The submitted voice samples are reviewed by the Bhashini.ai team to shortlist artists who meet the required professional voice quality standards. Shortlisted candidates then undergo a reading proficiency assessment, conducted online, in which they are asked to read impromptu sentences aloud. This assessment evaluates their ability to read accurately and fluently in the first attempt. Artists are scored on a scale of 1 to 10, and only those achieving a minimum threshold proceed further. In addition, candidates are interviewed to assess their prior experience in proofreading and their commitment to their native language.
Final Selection and Consent
For artists who successfully clear both the voice quality review and reading proficiency assessment, a senior leadership review is conducted. A Director at Bhashini.ai evaluates the submitted recordings and assessment videos to finalize the top voice artists for each language and gender. Selected artists are personally contacted, their queries are addressed, and they are formally onboarded through the signing of a Speech Recording Consent Agreement, along with a mutually agreed recording schedule.
Information to be provided by voice artists before submitting their voice samples
Language Selection
Recording Workflow
Upload Voice Samples
Each voice artist selected for the TTS program undergoes a structured orientation session designed to align artistic performance with the technical requirements of Text-to-Speech model training. During this session, artists are introduced to the fundamentals of how TTS models are built and trained, and how aspects such as pronunciation accuracy, natural pacing, and appropriate pauses directly influence the naturalness and quality of the synthesized voice.
As part of this orientation, clear and practical recording guidelines (as given below) are shared with the artists.
Recording Guidelines for Voice Artists
Read the script in a clear, natural, and well-paced manner.
Maintain a consistent speaking pace throughout the entire recording session.
If a pronunciation error occurs at any point in a sentence (for example, incorrect articulation of phonetic features such as Maha Prana or Anunasika), re-record the entire sentence from the beginning.
Avoid common recording issues, including:
Movement-related noise: Minimize any sounds caused by body movement, clothing, jewellery, rustling papers, or contact with furniture while recording.
Plosives: Avoid strong bursts of air hitting the microphone, which can cause popping sounds, especially on words beginning with sounds such as “P”. If a plosive occurs, re-record the sentence after adjusting microphone position and/or vocal force.
Continuity: Maintain a consistent vocal tone and style across all recordings.
Reading too fast or too slow: Read at a pace that is comfortable and easily understandable for listeners. Reading too fast may reduce clarity and cause skipped words, while reading too slowly may sound unnatural and reduce phonetic coverage.
All speech recordings are carried out in Bhashini.ai’s professional recording studio, designed to meet the stringent quality requirements of large-scale Text-to-Speech model training. The studio comprises a dedicated voice booth and a separate control room, both treated with state-of-the-art acoustic materials to ensure effective soundproofing and to eliminate unwanted reverberation and echo.
To preserve signal integrity and minimize interference, the voice booth contains no electrical equipment, with the exception of essential audio hardware. Speech is captured using a professional-grade analog cardioid condenser microphone (Neumann TLM-103), connected via high-quality audio cabling to a high-fidelity audio interface (UAD Apollo Twin X) located in the control room. Voice artists listen to the instructions from the sound engineer through studio-grade headphones (Audio Technica ATH-M50X), also connected to the control room’s audio interface.
Recording scripts are provided as printed material, enabling artists to focus fully on performance while also marking any minor textual issues, which are later reviewed and corrected by the validation team. For consistency across sessions, the optimal microphone position for each voice artist is identified, marked, and reproduced whenever the artist returns for subsequent recordings.
The audio interface is connected to a high-end workstation (Mac) to ensure stable, glitch-free recording at professional sampling rates. At the start of each session, the sound engineer calibrates input levels to prevent signal clipping and to ensure clean, high-dynamic-range audio capture.
To support large-scale speech data collection while maintaining consistently high quality, Bhashini.ai employs purpose-built tools and well-defined workflows that enhance productivity and simplify validation.
Purpose-Built Audio Recording Software
A high-performance, Java-based desktop application is used for speech recording, specifically designed to meet the requirements of TTS dataset creation. The application runs seamlessly across macOS, Windows, and Linux platforms.
The recording interface simultaneously displays the text transcript and the live audio waveform, enabling effective coordination between the sound engineer and the voice artist. It supports instantaneous, sentence-level recording and storage, with each audio file saved together with its corresponding, corrected text transcript. In addition, the system allows language validators to review and correct minor transcription errors in real time, significantly reducing post-recording validation effort.
Centralized Storage and Version Control Using an On-Premise Git Server
All audio recordings and corresponding transcripts are managed using an on-premise Git server hosted at https://gitlab.bhashini.ai. A dedicated private repository is created for each language and gender combination to ensure clear separation and organization of datasets.
Daily recordings are uploaded to the repository in date-specific folders, each containing structured subdirectories for audio files and text transcripts. Standardized file naming conventions are followed to maintain consistency and traceability. The use of Git enables transparent version control, allowing efficient tracking, review, and delivery of changes to both audio recordings and transcripts over time.
Purpose-Built Audio Recording Software