IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets

IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets

SUMMARY

Funded by MeitY’s Bhashini initiative, the dataset, called IndicVoices, spans 22 Indian languages and 7,348 hours of audio from 16,237 speakers

IndicVoices said that it plans to capture nearly 17,000 hours of voice data from more than 400 districts across the country in the near future

IndicVoices aims to build the country’s first automatic speech recognition model that encompasses all the 22 languages listed in the Constitution’s eighth schedule

AI4Bharat, a research lab at IIT-Madras, unveiled a comprehensive, open-source speech dataset, called IndicVoices, on Wednesday (March 6). 

Funded by the Ministry of Electronics and Information Technology’s (MeitY) Bhashini initiative and other non-profits, the dataset spans 22 Indian languages and 7,348 hours of audio from 16,237 speakers.

Of the total 7,348 hours of audio, a majority (74%) is extempore while the remaining is read (9%) and conversational audio. Bharat4AI also said that 1,639 hours have already been transcribed under the initiative. 

In a blog on its website, IndicVoices said that it plans to capture nearly 17,000 hours of voice data from more than 400 districts across the country in the near future.

“It’s a step towards collecting spontaneous speech data across the rich tapestry of Indian languages, while honouring the vast linguistic, cultural, and demographic diversity! With this, we release 7,348 hours of speech data! Let’s push the boundaries of Indic speech technologies!” said AI4Bharat on X.

The project claims to have employed more than 1,893 individuals, including language experts, local mobilizers, coordinators, quality control experts, transcribers, language leads, among others.

With this, IndicVoices aims to build the country’s first automatic speech recognition (ASR) model that encompasses all the 22 languages listed in the eighth schedule of the Indian Constitution. ASR models employ artificial intelligence (AI) or machine learning (ML) to convert human speech into readable text. 

While most ASR models are primarily trained on the English language, the initiative could enable the training of such models to transcribe Indian languages. Once trained, it can then be deployed for various aspects, including governance delivery and ensuring government websites are available for the general public in their language of choice. 

The push for IndicVoices is part of the Centre’s larger programme aimed at spurring AI-led innovation in the country. A case in point has been Bhashini, an AI-led language translation system, that was recently used by Prime Minister Narendra Modi to translate his speech in real-time to Tamil.

The state-backed Bhashini project has reportedly contributed $5-$6 Mn to AI4Bharat for the purpose of data collection for its AI models. Besides, Bhashini has also reportedly funded more than 70 research institutes, including IIT Bombay, IISc Bengaluru, and IIT Mandi

While Bhashini eventually aims to leverage the datasets to build a National Public Digital Platform for languages to offer services for Indian citizens, the open source audio repository can also be used by the general public for research and development of AI products. 

Note: We at Inc42 take our ethics very seriously. More information about it can be found here.

You have reached your limit of free stories
Become An Inc42 Plus Member

Become a Startup Insider in 2024 with Inc42 Plus. Join our exclusive community of 10,000+ founders, investors & operators and stay ahead in India’s startup & business economy.

2 YEAR PLAN
₹19999
₹7999
₹333/Month
Unlock 60% OFF
Cancel Anytime
1 YEAR PLAN
₹9999
₹4999
₹416/Month
Unlock 50% OFF
Cancel Anytime
Already A Member?
Discover Startups & Business Models

Unleash your potential by exploring unlimited articles, trackers, and playbooks. Identify the hottest startup deals, supercharge your innovation projects, and stay updated with expert curation.

IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets-Inc42 Media
How-To’s on Starting & Scaling Up

Empower yourself with comprehensive playbooks, expert analysis, and invaluable insights. Learn to validate ideas, acquire customers, secure funding, and navigate the journey to startup success.

IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets-Inc42 Media
Identify Trends & New Markets

Access 75+ in-depth reports on frontier industries. Gain exclusive market intelligence, understand market landscapes, and decode emerging trends to make informed decisions.

IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets-Inc42 Media
Track & Decode the Investment Landscape

Stay ahead with startup and funding trackers. Analyse investment strategies, profile successful investors, and keep track of upcoming funds, accelerators, and more.

IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets-Inc42 Media
IIT-Madras’ AI4Bharat Unveils IndicVoices, Offers Access To 7,300 Hours Of Speech Datasets-Inc42 Media
You’re in Good company