Funded by MeitY’s Bhashini initiative, the dataset, called IndicVoices, spans 22 Indian languages and 7,348 hours of audio from 16,237 speakers
IndicVoices said that it plans to capture nearly 17,000 hours of voice data from more than 400 districts across the country in the near future
IndicVoices aims to build the country’s first automatic speech recognition model that encompasses all the 22 languages listed in the Constitution’s eighth schedule
Inc42 Daily Brief
Stay Ahead With Daily News & Analysis on India’s Tech & Startup Economy
AI4Bharat, a research lab at IIT-Madras, unveiled a comprehensive, open-source speech dataset, called IndicVoices, on Wednesday (March 6).
Funded by the Ministry of Electronics and Information Technology’s (MeitY) Bhashini initiative and other non-profits, the dataset spans 22 Indian languages and 7,348 hours of audio from 16,237 speakers.
Of the total 7,348 hours of audio, a majority (74%) is extempore while the remaining is read (9%) and conversational audio. Bharat4AI also said that 1,639 hours have already been transcribed under the initiative.
In a blog on its website, IndicVoices said that it plans to capture nearly 17,000 hours of voice data from more than 400 districts across the country in the near future.
“It’s a step towards collecting spontaneous speech data across the rich tapestry of Indian languages, while honouring the vast linguistic, cultural, and demographic diversity! With this, we release 7,348 hours of speech data! Let’s push the boundaries of Indic speech technologies!” said AI4Bharat on X.
The project claims to have employed more than 1,893 individuals, including language experts, local mobilizers, coordinators, quality control experts, transcribers, language leads, among others.
With this, IndicVoices aims to build the country’s first automatic speech recognition (ASR) model that encompasses all the 22 languages listed in the eighth schedule of the Indian Constitution. ASR models employ artificial intelligence (AI) or machine learning (ML) to convert human speech into readable text.
While most ASR models are primarily trained on the English language, the initiative could enable the training of such models to transcribe Indian languages. Once trained, it can then be deployed for various aspects, including governance delivery and ensuring government websites are available for the general public in their language of choice.
The push for IndicVoices is part of the Centre’s larger programme aimed at spurring AI-led innovation in the country. A case in point has been Bhashini, an AI-led language translation system, that was recently used by Prime Minister Narendra Modi to translate his speech in real-time to Tamil.
The state-backed Bhashini project has reportedly contributed $5-$6 Mn to AI4Bharat for the purpose of data collection for its AI models. Besides, Bhashini has also reportedly funded more than 70 research institutes, including IIT Bombay, IISc Bengaluru, and IIT Mandi
While Bhashini eventually aims to leverage the datasets to build a National Public Digital Platform for languages to offer services for Indian citizens, the open source audio repository can also be used by the general public for research and development of AI products.
{{#name}}{{name}}{{/name}}{{^name}}-{{/name}}
{{#description}}{{description}}...{{/description}}{{^description}}-{{/description}}
Note: We at Inc42 take our ethics very seriously. More information about it can be found here.