In what can be considered as a milestone moment for the country, Indian AI startup Sarvam AI's newly introduced Bulbul V3 and Sarvam Vision models outperform global AI giants Google Gemini and ChatGPT in text-to-speech, OCR, and document reading across 22 Indic languages.
An Indian AI startup has achieved what many thought improbable - building artificial intelligence models that outperform Google's Gemini and OpenAI's ChatGPT in critical areas of Indian language processing, marking a significant milestone for the country's homegrown technology sector.
Sarvam AI, founded in 2023 by Pratyush Kumar and Vivek Raghavan, has launched two breakthrough models - Bulbul V3 for text-to-speech and Sarvam Vision for optical character recognition - that have beaten global leaders including Gemini Pro and ElevenLabs in processing Indic languages. The achievement has drawn attention from the highest levels of Indian government, with a Union Minister publicly praising the company's work.
Blind Study Victory
Bulbul V3, Sarvam AI's latest text-to-speech model, topped a blind study with over 20,000 votes, demonstrating its superiority in handling 11 Indian languages with natural-sounding voices and remarkably low error rates. The model's performance has been particularly praised for its ability to capture the nuances and regional variations that characterise India's linguistic diversity.
The achievement is especially noteworthy given that major global AI labs have typically struggled with Indian languages, often treating them as secondary priorities in their development roadmaps.
Document Reading Breakthrough
Sarvam Vision, the company's optical character recognition model, has achieved 84.3 percent accuracy on messy, real-world documents across 22 Indic scripts - a significant technical achievement given the complexity and variation in Indian writing systems. The model's ability to handle poorly scanned documents, handwritten text, and mixed-language content has drawn particular attention from developers and enterprises.
While global models like Google's Gemini and OpenAI's GPT-4 Vision have made strides in document processing, they have historically underperformed on Indian scripts, particularly when dealing with less common languages or degraded document quality.
The success story hasn't been without its challenges. When the company first launched, it faced criticism for its initial direction of training small Indic language models. Industry observers questioned whether focusing on a niche market would prove viable against well-funded global competitors.
However, Sarvam AI made a strategic pivot, backed by $53 million in funding and access to government-provided GPUs, refocusing its efforts on specific use cases where Indian languages presented unique technical challenges that global models weren't adequately addressing.
"When I wrote about them a year ago, I felt like the direction to train small 'indic' language models was wrong. But boy, have they turned it around," noted one industry observer. "They have the best text-to-speech, speech-to-text, and OCR models for Indic languages, and that's actually really valuable."
While Sarvam AI's success is being celebrated as an Indian achievement, the implications extend beyond national pride. The company's performance demonstrates that focused, use-case-specific AI models can compete with and even surpass general-purpose models from better-funded competitors when applied to specialized domains.
The government's support through GPU access has been crucial, addressing one of the major barriers Indian AI startups face - the high cost of compute resources needed to train large language models.
What is Sarvam AI?
Sarvam AI is a Bengaluru-based AI startup founded in 2023 by Pratyush Kumar and Vivek Raghavan. It focuses on building AI models optimised for Indian languages. The company has so far raised $53 million in funding and has received government support through GPU access. The startup's models handle 11 languages for voice generation and support 22 different Indic scripts for document processing, achieving 84.3 percent accuracy on real-world documents and winning a blind study with over 20,000 votes for its natural-sounding voice synthesis.