Today’s AI models are slow, expensive, and lack the depth of human intelligence. Their high computational costs mean only the largest companies can afford to develop and deploy them. “The next generation of AI requires a phase shift in how we think about model architectures and machine learning,” says Karan Goel, co-founder and CEO of Cartesia. For centuries, human voices have carried the weight of communication, from intimate conversations to speeches that shaped history. But as technology advances, the way we engage with voice itself is undergoing a transformation that companies like Cartesia are determined to push forward.
Cartesia, a real-time AI-driven voice technology company, has secured $64 million in a Series A funding round led by Kleiner Perkins. The funding, also backed by Index Ventures, Lightspeed, A*Capital, Factory, Greycroft, Dell Technologies Capital, and Samsung Ventures, brings the startup’s total investment to $91 million. With over 10,000 customers already leveraging its technology, including Quora, Cresta, and Rasa, Cartesia is poised to shape the future of voice AI.
We've raised a $64M Series A led by @kleinerperkins to build the platform for real-time voice AI.
— Cartesia (@cartesia_ai) March 11, 2025
We'll use this funding to expand our team, and to build the next generation of models, infrastructure, and products for voice, starting with Sonic 2.0, available today.
Link below… pic.twitter.com/0PV764sEdi
The Drive Behind Sonic 2.0
Cartesia’s latest voice model, Sonic 2.0, which is designed to deliver ultra-realistic, low-latency speech, making it well-suited for applications in conversational AI, content production, and real-time communication. Built on a state space model (SSM) architecture, Sonic 2.0 has doubled in size compared to its predecessor while achieving higher speed and efficiency. It offers 90-millisecond latency for full models and an even faster 40 milliseconds for real-time applications performance metrics that surpass industry competitors.
Latency is the biggest giveaway that an AI voice isn’t real. Even if it sounds human, a delay in response breaks the flow of conversation and makes it feel unnatural.
— Vapi (@Vapi_AI) March 12, 2025
That’s why we integrated @Cartesia_ai’s Sonic 2.0 into Vapi.
– Ultra-low latency (40ms) delivering responses… pic.twitter.com/SrcMGTmAFC
Beyond speed, Cartesia has focused on refining voice cloning technology. Sonic 2.0 is capable of capturing subtle nuances, accents, and tonal variations, making it particularly valuable for customer service, content localization, and accessibility tools. Additionally, the company has introduced Sonic Turbo, an optimized version that further improves synthesis speed, reducing latency to 45 milliseconds.
Infrastructure Built for Enterprise-Grade AI
Cartesia isn’t just focused on AI performance, it’s also ensuring reliability. The company boasts 99.9% uptime, SOC-2 and HIPAA compliance, and an API designed for seamless developer integration. The ability to deploy Sonic on-premise or on-device makes AI-driven voice applications more accessible across industries.
CEO Karan Goel sees voice AI as a fundamental shift in communication. “This is the year of voice AI, and it’s going to be everywhere,” he said during the funding announcement.
To further refine its technology, Cartesia is integrating new features like voice changer and infill editing, alongside advancements in streaming architectures and on-device inference. These developments aim to give users more control over their AI-generated speech, improving both customization and realism.
Karan Goel’s journey into AI started far from Silicon Valley’s tech hubs. His great-great-grandfather founded a manufacturing business in 1897, producing lab equipment such as beakers and microscopes. Growing up near his family’s factory, Goel was immersed in an environment of precision and logistics.
“The best—and certainly biggest—times used to be shipping days, when all the stuff used to get put into trucks and shipped, because they were exporting to Europe and the U.S.,” Goel recalled. “That was actually pretty fun, because you had all these boxes, then you put them all in trucks, very neatly stacked up.”
This early exposure to optimization and efficiency became a core part of his mindset. He pursued his PhD at Stanford’s AI Lab, working under renowned computer scientist Christopher Ré. There, he met co-founders Arjun Desai, Brandon Yang, and Albert Gu each driven by the same intellectual curiosity. Gu, now an assistant professor at Carnegie Mellon, has played a critical role in the development of state space models (SSMs), which power Cartesia’s AI.
“I feel like a lot of my decisions are driven by the people that I work with, not necessarily by the work they’re doing,” Goel said. “I figure, if you’re around really smart, interesting people, you generally end up doing pretty amazing things.”
The Science Behind Sonic
Cartesia’s models are built on derivatives of Mamba, a widely used state space model pioneered by Gu and Princeton professor Tri Dao. SSMs give AI a form of working memory, making computations more efficient and improving model performance.
Unlike traditional AI systems that struggle with long-form dependencies, Cartesia’s approach enables real-time processing at scale. This efficiency is particularly relevant as AI adoption grows and computational demands increase. “Moving forward, voice is going to be such an important medium of communication,” Goel emphasized. “That’s how you communicate with businesses. That’s how you will communicate with computers. That’s how you communicate with robots eventually.”
Voice AI is a delicate balance of science and art. While technical advancements drive performance, the human ear is exceptionally sensitive to imperfections in speech. Small inconsistencies can create an unsettling effect known as the uncanny valley, where voices sound almost but not quite human.
“You can have a video, but if the voice doesn’t sound authentic and natural, the whole thing feels robotic,” Goel explained. “A single word can communicate a lot of meaning.”
This precision is what Cartesia aims to perfect. By focusing on speed, realism, and control, the company is positioning itself as a key player in the evolving AI landscape. With voice AI becoming more prevalent from call centers to virtual assistants to content creation the demand for natural-sounding, highly responsive AI will only grow.
Cartesia’s name itself is a nod to precision and mathematical foundations, drawn from René Descartes and Cartesian coordinates. “A lot of our work is very deeply mathematical,” Goel noted. But while the foundations are technical, the end goal is deeply human, making AI voices as seamless and natural as possible.