David AI Becomes the Go To Voice Data Partner for Leading AI Labs

By Anshika Mathews
Published on May 22, 2025

AI Startups

Voice is how AI enters the real world.

A year ago, Tomer Cohen and Ben Wiley barely made the deadline to submit their Y Combinator application. Today, their startup David AI is emerging as one of the most important data providers in the fast-developing field of audio-based artificial intelligence.

The company announced a $25 million Series A funding round led by Alt Capital and Amplify Partners, with participation from First Round Capital, Y Combinator, BoxGroup, and a network of angel investors specializing in frontier audio research. The round values the company at north of $100 million, less than twelve months after it was founded.

David AI’s mission is rooted in a straightforward, but underserved problem in AI development: the shortage of high-quality training data for voice-based systems. “We started to think that the next phase of AI, the final evolution of AI, is where the AI moves out of the laptop and keyboard interface and into the real world,” Cohen, a former McKinsey analyst and chief of staff at Scale AI, told Forbes.

That idea laid the foundation for David AI, which today provides curated, multilingual, and metadata-rich audio datasets for AI labs developing voice models. Its core offering is paying people to read scripts or conduct conversations, and then meticulously recording, refining, and annotating that data.

“If you’re an AI lab, you probably want to be focused on algorithms and model development and not just this very low-level operational, technical, niche work,” Cohen said.

Building a Business on Audio

The market need is acute. In the past few years, AI has been dominated by text-based models like ChatGPT. But voice is quickly becoming the next frontier, driven by demand for AI-powered phone agents, wearables, humanoid robots, and embedded assistants. All of these applications rely on naturalistic, multi-language voice data that doesn’t exist in sufficient quantity or quality.

David AI has accumulated over 100,000 hours of audio across 15+ languages, complete with annotations on accents and dialects. That makes it one of the most comprehensive sources of audio training data available to AI labs, many of whom now count on David AI for production-scale datasets.

One reason for the excitement: the simplicity of the company’s model meets a sharp, unsolved pain point in the market. “Companies are just voracious for data nowadays,” said Sarah Catanzaro, general partner at Amplify Partners. “The beauty of [David AI] is it solves this urgent need that voice AI developers face today…but it’s also a relatively simple solution. If they need data, sell them data, you don’t need to overcomplicate it.”

First Round Capital’s Liz Wessel, who led the company’s $5 million seed round earlier this year, sees the same tailwinds. “It makes sense,” she said. “Everyone knows that it’s been text-based AI for the last couple of years with ChatGPT, and now everyone is starting to figure out how to bring AI to voice.”

A Data Lab for the Voice Revolution

David AI refers to itself not just as a data provider, but as an “audio data research lab.” It’s a distinction that Cohen believes is essential. “We build audio datasets with the same rigor that researchers apply to model development,” he wrote in the company’s Series A announcement. “Designing, evaluating, iterating, and scaling datasets with precision.”

This research-driven mindset gives David AI a significant edge in meeting the complex and evolving needs of top AI developers. The company tailors datasets for specific model architectures and use cases, including real-time, full-duplex voice systems—the kind that require synchronous, channel-separated dialogue data at a scale far beyond what exists in the public domain.

A 2024 research paper by Meta AI underscores this need. It noted that even when combining all major public spoken dialogue datasets, only ~3,000 hours of usable audio exists. Meta called for “millions of hours” of more structured, context-rich recordings to effectively train future end-to-end speech models.

David AI is directly addressing that gap. The startup claims its datasets are already powering cutting-edge voice production systems and research across some of the most advanced labs in the world. Though the company declined to name specific clients, it confirmed it now works with most of the so-called “Magnificent Seven” tech giants understood to include Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla.

In terms of business growth, David AI has exceeded expectations. In less than a year, it has surpassed eight figures in annualized revenue run rate. And with new funding in place, the team plans to expand across research, engineering, product, and operations. “Because of our focus, we can invest deeply in audio products, infrastructure, operations, and models which, in turn, allows us to build the best audio datasets,” Cohen wrote.

What’s Next

The Series A funding will support David AI’s goal of becoming the world’s leading audio data research institution. The startup’s vision is built on a core conviction: “Voice is how AI enters the real world.” That belief is now backed not just by venture dollars, but by a growing roster of customers who see voice as the next major user interface shift and who need better data to get there.

📣 Want to advertise in AIM Research? Book here >

Anshika Mathews

Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co

Subscribe to our Latest Insights