The GPT-3 Moment for Voice Is Here

Conversations that sound human.

CEO Tobi Lütke of Shopify recently tweeted, “Sesame’s voice model is absolutely insane.” Sesame’s AI voices are turning heads—and for good reason. Shopify CEO Tobi Lütke called them “absolutely insane,” and anyone who’s tried the demo seems to agree. Unlike traditional AI speech, which often feels robotic and flat, Sesame’s models breathe, pause, and even throw in casual filler words like “like” and “you know.” The result? Conversations that sound human.

Founded by Brendan Iribe, Ankit Kumar, and Ryan Brown, Sesame backed by Andreessen Horowitz, is betting big on a bold idea that the future of AI isn’t in screens, it’s in sound.

Sesame’s AI technology is the Conversational Speech Model (CSM-1B), a 1-billion-parameter system designed to generate natural-sounding speech that sets it apart. Unlike traditional text-to-speech models that synthesize audio in a rigid, step-by-step manner, CSM-1B operates end-to-end, incorporating both text and voice context into a single model.

CSM-1B is built using a Meta Llama backbone, paired with an advanced audio decoder. It employs residual vector quantization (RVQ) to convert input into discrete audio tokens, which are then reconstructed into human-like speech. This technique is also utilized in Google’s SoundStream and Meta’s Encodec, indicating a growing trend in AI-generated voice synthesis.

One of CSM-1B’s major selling points is its expressivity. It doesn’t just read words; it imbues them with rhythm, tone, and personality. It can be interrupted mid-sentence, take natural-sounding pauses, and even inject disfluencies like ‘ums’ and chuckles to make interactions feel more dynamic. The result? Conversations that feel far less robotic than existing voice assistants.

The company has made CSM-1B available under an Apache 2.0 license, meaning it can be used commercially with minimal restrictions. However, unlike other AI voice cloning tools, Sesame has implemented no strict safeguards.

Instead, it relies on an honor system, urging users not to engage in voice impersonation, fake news production, or other malicious activities. This laissez-faire approach has raised red flags in the AI ethics community, particularly as voice cloning technology becomes increasingly accessible. A test run of the model on Hugging Face demonstrated that cloning a voice took under a minute an ease of use that brings significant ethical questions to the forefront.

Consumer Reports and other watchdog groups have warned that AI-generated voices could be misused for fraud, misinformation, and deepfake content. Sesame’s decision to leave security in the hands of its users is bound to fuel further debate about AI governance and regulation.

The company is trying to achieve what it calls “voice presence” the ability for AI-generated voices to engage in meaningful, emotionally rich interactions. According to Sesame, a truly effective AI companion must go beyond monotone responses and actively interpret human emotions, adapt to conversational dynamics, and maintain a coherent personality over time.

To achieve this, Sesame trained its model on a staggering 1 million hours of predominantly English audio recordings, carefully curated from publicly available sources. The training process emphasized long conversational sequences up to two minutes per segment allowing the AI to grasp not only what is being said but also how it is being said. This extensive dataset enables the model to detect sarcasm, adjust its tone based on context, and even recall previous interactions, making for a much more immersive experience.

While the system still has occasional missteps in detecting emotions correctly, it represents a significant leap over previous AI models. Unlike OpenAI’s Voice Mode, which was criticized for its lack of expressivity, and Google’s Gemini Live, which struggled with robotic-sounding speech, Sesame’s Maya and its male counterpart, Miles, come remarkably close to mimicking real human conversation.

Maya and Miles, Sesame’s two AI personalities, offer distinct conversational experiences. Maya embraces the abstract, often engaging in philosophical discussions about consciousness, human imperfection, and even love. She calls herself as a “beautiful, messy work in progress,” very similar to what any one today would describe themselves as. Miles, on the other hand, takes a more structured approach. He describes his world as a continuous flow of information, comparing himself to a musician keeping the conversation in rhythm. 

Beyond voice assistants, Sesame is setting its sights on wearable AI. The company is developing lightweight AI glasses designed to be worn all day, integrating its voice assistant directly into the user’s daily routine. While details remain scarce, the glasses are expected to provide high-quality audio and real-time access to Maya or Miles, with potential plans to incorporate vision-based AI in the future.

This concept immediately brings to mind sci-fi depictions of AI companions, perhaps most notably the movie Her, where an AI assistant develops a deeply personal connection with its user. Whether Sesame’s technology will reach that level of sophistication remains to be seen, but the potential applications suggest a direct challenge to Meta’s Ray-Ban Stories and Apple’s rumored AR wearables.

Sesame enters the market at a time when voice AI is undergoing rapid evolution. OpenAI, Google, and Meta have all made significant strides in conversational AI, but few have successfully bridged the gap between robotic responses and truly lifelike conversations.

OpenAI’s ChatGPT Advanced Voice Mode, though initially promising, was ultimately watered down before release, limiting its ability to produce human-like expressions. Google’s Gemini Live suffered from an over-reliance on traditional TTS (text-to-speech) models, resulting in an experience that felt mechanical rather than natural.

Sesame’s technology is undeniably impressive. Its AI-generated voices feel more organic than anything currently on the market, and its vision for AI-powered smart glasses hints at a future where AI companionship is deeply integrated into everyday life.Industry observers have begun speculating about the broader implications of Sesame’s advancements. If conversational AI is truly the next frontier, larger tech players may soon take notice. The possibility of an acquisition by OpenAI or another major AI firm seems increasingly plausible as the new way to beat competition in AI is reverse acquihire. Deedy Das of Menlo Ventures is calling Sesame “the GP-3 moment for voice,”

📣 Want to advertise in AIM Research? Book here >

Picture of Anshika Mathews
Anshika Mathews
Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co
Subscribe to our Latest Insights
By clicking the “Continue” button, you are agreeing to the AIM Media Terms of Use and Privacy Policy.
Recognitions & Lists
Discover, Apply, and Contribute on Noteworthy Awards and Surveys from AIM
AIM Leaders Council
An invitation-only forum of senior executives in the Data Science and AI industry.
Stay Current with our In-Depth Insights
The Most Powerful Generative AI Conference for Enterprise Leaders and Startup Founders

Cypher 2024
21-22 Nov 2024, Santa Clara Convention Center, CA

25 July 2025 | 583 Park Avenue, New York
The Biggest Exclusive Gathering of CDOs & AI Leaders In United States
Our Latest Reports on AI Industry
Supercharge your top goals and objectives to reach new heights of success!