Sarvam-1 is optimised for 10 Indian languages, including Hindi, Bengali, Tamil, and Telugu, besides English.
The model aims to tackle two key challenges – token inefficiency and poor data quality for Indic languages.
Sarvam AI also announced a partnership with Yotta Data Services for the Indic language model
Sarvam AI has launched Sarvam-1, a 2 Bn parameter large language model built specifically for Indian languages.
In a blogpost, the startup said that the model is optimised for 10 Indian languages, including Hindi, Bengali, Tamil, and Telugu, besides English.
The model aims to tackle two key challenges – token inefficiency and poor data quality for Indic languages.
Token inefficiency refers to the number of pieces (tokens) a language model needs to break a word into in order to process it. For instance, in English, a word like “apple” might be processed as one token. But in some Indian languages, the same word might get split into 4-8 tokens. This makes processing slower and less efficient.
Sarvam-1 claims to have achieved a token efficiency rate of 1.4-2.1 tokens per word (vs. 4-8 in existing models). It said that the LLM is trained on Sarvam-2T, a 2-trillion-token dataset curated specifically for Indian languages. This ensures better performance in areas like cross-lingual translation and question-answering.
Despite being smaller than models like Meta’s Llama-3.2-3B, Sarvam-1 claims to have outperformed them in several industry benchmarks.
Sarvam-1 is now available for download on Hugging Face.
Earlier on Thursday (October 25), chip giant Nvidia’s CEO Jensen Huang said that the Hindi language model is the hardest to develop.
Meanwhile, Sarvam AI also announced its partnership with Yotta Data Services. The Sarvam-1 model has been trained on Yotta’s Shakti Cloud infrastructure, the startup said.
Earlier this year, the startup launched its full-stack GenAI platform comprising multiple products — Sarvam Agents, Sarvam 2B, Shuka 1.0, Sarvam Models, and A1.
The startup raised $41 Mn (around INR 342 Cr) in its Series A funding round led by Lightspeed Venture Partners, in participation with Peak XV Partners and Khosla Ventures, in December last year.
At the heart of all these is the growing Indian GenAI market, which is expected to clock a CAGR of 48% between 2023 and 2030 to become an over $17 Bn opportunity.