Voice AI That Feels Like a Conversation with Cartesia’s Sonic Transforming Real-Time Interactions

By Anshika Mathews
Published on November 11, 2024

Market & Industry

Our vision is to bring high-powered models to devices, allowing them to interact and respond with genuine immediacy and a touch of humanity.

Two Stanford AI Lab alumni Karan Goel and Albert Gu founded Cartesia in 2023 as a transformative force in the voice AI industry, creating groundbreaking real-time, multimodal intelligence solutions built on their proprietary State Space Models (SSM) architecture. Cartesia’s mission is to redefine the boundaries of AI applications through efficiency and low latency, focusing particularly on real-time text-to-speech and voice transformation.

Revolutionizing Voice AI with Sonic

Cartesia’s flagship product, Sonic, is a next-generation text-to-speech engine delivering real-time, ultra-natural voice output in under 100 milliseconds—an unparalleled achievement in the realm of interactive applications. Sonic’s ultra-low latency, combined with its nuanced audio quality, positions it as a vital asset for industries where immediacy and engagement are critical. “Sonic is a really fast text-to-speech engine,” Goel explains, underscoring its appeal to sectors as diverse as gaming, virtual assistants, and high-interactivity customer support. Sonic opens new possibilities for seamless AI-driven conversations, allowing for a fully immersive experience that mirrors human interaction more closely than any comparable technology.

“With Sonic,” Goel continues, “we’re not just creating voices; we’re designing experiences that feel alive. The dream is to have millions of users interacting in real-time with these models, where every response is instant and lifelike.”

Expanding Sonic’s Capabilities

Cartesia’s ambition extends beyond just voice generation. Sonic incorporates emotionally responsive elements, enabling voice output that goes beyond words to capture the subtleties of human expression. Goel candidly shares the company’s objective to bridge the gap between synthetic and human voices. “Most text-to-speech systems fail the audio Turing test. We’re constantly refining Sonic so that it captures a range of human emotions and nuances. That’s what will make interactions truly meaningful.”

Albert Gu highlights the foundation of this technology in Cartesia’s proprietary State Space Model architecture. “SSM enables us to process perceptual data like audio and video more efficiently,” Gu explains, “while scaling linearly, unlike Transformer models which scale quadratically. This allows us to create a real-time experience without sacrificing quality or efficiency.” This architecture not only boosts performance but also makes Sonic versatile for diverse applications, from gaming and customer service to new frontiers in interactive media.

Forging Partnerships to Showcase Innovation

Cartesia’s groundbreaking technology has already attracted strategic partners eager to push boundaries alongside them. In collaboration with Cerebrium, Cartesia showcased Sonic’s capabilities by integrating it into a demo with an AI avatar. Cerebrium’s Mistral-7B model provided the interactive dialogue, Tavus contributed the animated avatar, and Cartesia’s Sonic API generated the voice, creating an AI-driven avatar capable of engaging with users in real time.

The impact of Sonic’s speed and realism was immediately evident. Artificial Analysis, a third-party platform, tested Sonic’s capabilities, affirming that it offers unparalleled voice quality and precision. “The feedback we received was amazing,” Goel notes. “Sonic’s latency of less than 100 milliseconds and our control over voice elements like speed, emotion, and regional accents put it in a league of its own.”

Another notable partnership has been with Ego, a creator of AI-driven simulations, where Sonic’s capabilities are being applied in real-world gaming through the mod Thrall for Valheim. The collaboration with Ego allows Sonic to power realistic Non-Player Characters (NPCs), each equipped with their own dynamic personalities and vocal expressions. Goel explains, “Our partnership with Ego underscores Sonic’s ability to immerse players by adding complex emotional layers to in-game characters. Imagine NPCs who sound genuinely distressed, excited, or fearful. That’s the kind of realism we’re bringing to gaming.”

A Global Reach Through Multilingual Support

Cartesia has recently expanded Sonic’s reach globally with Sonic Multilingual, offering voice generation across 14 languages, including Hindi, Italian, Korean, and Russian. This step is part of the company’s commitment to make voice AI accessible on a global scale. Sonic Multilingual ensures the same low-latency, high-quality voice output in these languages, empowering users worldwide to interact seamlessly with the AI in their native language. Cartesia’s SSM architecture supports this global reach, allowing for efficient voice processing that doesn’t compromise quality.

Voice Changer: Adding a New Dimension to Audio Transformation

In addition to Sonic, Cartesia recently launched its Voice Changer, which enables users to transform voices within any audio clip while preserving essential attributes like prosody, expressiveness, and emotional depth. This feature, tailored for content creators, gaming, and even businesses, reflects Cartesia’s vision for expanding voice AI’s versatility. Creators can use Voice Changer to add nuanced characters in narratives or audio presentations, while developers and businesses can leverage it to personalize user experiences with brand-consistent voices that feel authentic and expressive.

Impacting Business Interactions Through Voice Quality and Speed

The value Cartesia brings to business interactions is perhaps best exemplified by companies like Goodcall, which shifted from another provider to fully adopt Sonic for automated phone services. Goodcall’s CEO, Bob Summers, emphasizes Sonic’s edge over competitors, noting its “sub-100 ms latency” and unmatched conversational quality. “Sonic’s low latency has been a game-changer for us, keeping customers engaged and responding at a level of quality that other providers couldn’t match. It’s a leap forward in how we approach customer interactions.”

Building an Interactive Future

Looking ahead, Cartesia aims to drive even greater efficiency, exploring ways to make voice AI more expressive, accessible, and real-time across an array of devices. Goel sees Cartesia’s work as part of a broader movement toward creating AI that doesn’t just respond but truly engages. “The second wave of AI is about efficiency and making intelligence more accessible on the edge,” he explains. “Our vision is to bring high-powered models to devices, allowing them to interact and respond with genuine immediacy and a touch of humanity.”

Ultimately, Cartesia’s innovation with Sonic and Voice Changer is about more than technological advancement—it’s about reimagining the way we communicate with machines. In Goel’s words, “At Cartesia, we believe voice AI should feel as natural as speaking with a friend. When technology fades into the background and just lets you be in the moment, that’s when we know we’ve done our job right.”

📣 Want to advertise in AIM Research? Book here >

Anshika Mathews

Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co

Subscribe to our Latest Insights