AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services

AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services

SUMMARY

Cofounder Mitesh Khapra claimed that AI4Bharat has “gone to almost every district in the country” and tried to cover almost all the 22 official languages in the past three years

Khapra added that several startups and academic institutes are using AI4Bharat’s data to build their own models to accelerate the “adoption of language technologies”

AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions

IIT Madras-incubated artificial intelligence (AI) lab, AI4Bharat, is reportedly collecting 10 Tn tokens of language data to build the “next generation of AI services”.

For context, tokens are basic units of input and output for large language models (LLMs), and are a unit of text that can be a word, character, or subword. 

As per Economic Times, AI4Bharat cofounder Mitesh Khapra claimed that the platform has “gone to almost every district in the country” and “tried to cover almost all the 22 official languages” in the past three years.

AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions. 

Noting that the platform has built the tools required for data collection from scratch, Khapra added that several startups, academic institutes and deeptech institutes are using the company’s data to build their own models to accelerate the “adoption of language technologies”.

“Our data, models and scripts are open sourced. You can build on top of that,” he said.

Khapra added that the data collected over the past three years will be fed into the “Ten Trillion Token” project.

“This is going to be required to make sure that we are able to build native Indic models that support Indian languages and not as an afterthought. We want to collect 10 Tn tokens in Indian languages that would be synthetic data that would be language information and cultural information,” he added. 

He also noted that the data, collected as part of the project, will also have use cases spanning farmers, children, digital payments and agriculture. 

The comments came on the sidelines of an event organised by Aadhaar architect Nandan Nilekani-backed People+ai, which too has undertaken a project to collect 10 Tn language tokens scraped from formal government documents to conversations. 

The People+ai’s project is envisaged with building datasets, which are the fundamental for training AI foundational models. While there is plenty of content online in English (nearly 55% of all internet data), the paucity of content makes it difficult to train LLMs in local vernacular languages. 

However, AI4Bharat and People+ai are looking to solve this problem by building datasets from ground up that can capture the cultural context, script and grammatical rules. 

Khapra’s comments come a year after AI4Bharat launched its open-source speech dataset, called IndicVoices. Funded by the electronics and IT ministry’s Bhashini initiative and other non-profits, the dataset spans 22 Indian languages.

Note: We at Inc42 take our ethics very seriously. More information about it can be found here.

You have reached your limit of free stories
This Holi, Paint Your Startup Journey
with Innovation & Intelligence!

Join The Community Trusted By India’s Top 1% Startup Founders, Investors & Operators and stay ahead in India’s startup & business economy.

Holi Offer Ending In
countdownmail.com
2 YEAR PLAN
₹19999
₹6499
₹270/Month
UNLOCK 68% OFF
Cancel Anytime
1 YEAR PLAN
₹9999
₹3499
₹291/Month
UNLOCK 65% OFF
Cancel Anytime
Already A Member?
Discover Startups & Business Models

Unleash your potential by exploring unlimited articles, trackers, and playbooks. Identify the hottest startup deals, supercharge your innovation projects, and stay updated with expert curation.

AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services-Inc42 Media
How-To’s on Starting & Scaling Up

Empower yourself with comprehensive playbooks, expert analysis, and invaluable insights. Learn to validate ideas, acquire customers, secure funding, and navigate the journey to startup success.

AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services-Inc42 Media
Identify Trends & New Markets

Access 75+ in-depth reports on frontier industries. Gain exclusive market intelligence, understand market landscapes, and decode emerging trends to make informed decisions.

AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services-Inc42 Media
Track & Decode the Investment Landscape

Stay ahead with startup and funding trackers. Analyse investment strategies, profile successful investors, and keep track of upcoming funds, accelerators, and more.

AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services-Inc42 Media
AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI Services-Inc42 Media
You’re in Good company