Unstructured’s Bet on Transforming Data Prep for AI with $0 in the Bank and a Vision for LLMs

I made a bet early on with Unstructured, I was like look we're not going to build anything in the first year of this defensible except for resolution on what the market wants, and the fastest way to achieve that is by building open-source.

Over the past decade, Brian Raymond and the founding team of engineers at Unstructured Technologies have been navigating a consistent challenge in the natural language processing (NLP) space: clients eager to dive into AI/ML solutions, but hindered by the frustrating reality that their data was trapped in unusable formats. This common pain point became the driving force behind the creation of Unstructured, a company focused on solving the data preprocessing problem that has plagued the industry for years.

Data Bottlenecks in NLP

The world of data science has long been held back by a persistent challenge in NLP: preparing unstructured data for machine learning models. Data scientists were continuously forced to build bespoke, one-off data connectors and preprocessing pipelines to make natural language data compatible with AI algorithms. But these solutions were not scalable. Data scientists dreaded these bottlenecks, and with good reason: there was, and still is, a lack of tooling to effectively automate this crucial step. Without efficient ways to convert raw, unstructured data into usable formats for machine learning, the full potential of AI and large language models (LLMs) could never be realized.

The team at Unstructured recognized early that to unlock the true capabilities of LLMs, the core problem had to be solved: the ability to seamlessly connect, transform, and stage any form of unstructured natural language data at speed and at scale. This led to the development of Unstructured Technologies.

Making Data Ready for LLMs

Unstructured began its journey in July 2022 with a bold vision: to provide an open-source toolkit that would transform raw, unstructured data into formats ready for use by LLMs. Initially, the company’s focus was on NLP workflows such as Named Entity Recognition (NER) and relation extraction models. They developed a suite of cleaning functions to ensure high-quality input, integrations with labeling tools like Argilla, and staging code that simplified the process of feeding data to models like HuggingFace.

However, the world of NLP was about to change dramatically. Just a few weeks after the launch of the Unstructured open-source library in September 2022, OpenAI’s ChatGPT exploded onto the scene, fundamentally transforming the landscape of generative AI. With the advent of ChatGPT and similar tools, developers worldwide were eager to interact with their data in entirely new ways. Unstructured’s platform, designed to automate the processing of raw data for LLM training, quickly found itself at the heart of this shift.

In response to the growing demand for LLM-related tools, Unstructured pivoted to focus on integrating its technology with the LLM stack. This included integrations with vector databases like Weaviate and LLM orchestration frameworks such as LangChain. By doing so, Unstructured positioned itself as a critical player in the emerging LLM tech stack. Within just a few months, the company saw impressive growth, with over 700,000 downloads of its library on PyPI and usage across more than 100 companies and 2,400 GitHub repositories.

Data Transformation at Scale

Unstructured’s core value proposition lies in its ability to bridge the data preparation gap. Their platform simplifies the process of ingesting and transforming any type of unstructured data—whether from PDFs, images, audio, video, or raw text—into machine-readable formats suitable for LLMs. This enables enterprises to process vast amounts of unstructured data quickly, automating the tedious and time-consuming steps that were once manual and error-prone.

The company provides several entry points for users:

  1. Open-Source Python Library: This is the cornerstone of Unstructured’s technology. It enables developers to easily integrate the platform into their existing workflows and use it for a wide range of data processing tasks.
  2. Containers: Pre-configured containers provide an out-of-the-box solution for enterprises, making it easy to deploy and scale Unstructured’s tools.
  3. Cloud-Hosted API: The API offers an enterprise-grade solution that can process over 20 different types of natural language files, including raw data and LLM-ready files. This is particularly valuable for large organizations with diverse data sources, as the API seamlessly integrates with storage services like AWS S3, Google Cloud Storage, and Microsoft Azure Blob.

Unstructured’s Impact on the LLM Ecosystem

The rise of generative AI models like GPT-3, ChatGPT, and Google Gemini has created an increasing demand for clean, structured data. However, unstructured data—accounting for over 80% of all enterprise data—remains a significant challenge. While structured data has long been manageable through modern data stacks, unstructured data still represents a massive hurdle in AI development. This is where Unstructured has found its niche.

By automating the transformation of unstructured data into LLM-compatible formats, Unstructured plays a pivotal role in making generative AI more accessible and efficient. Their technology is particularly valuable in the context of retrieval-augmented generation (RAG), where pretrained LLMs can access external data to augment their knowledge and improve the accuracy of responses. With RAG becoming a critical component of modern AI applications, Unstructured’s tools ensure that the underlying data is continually updated and transformed into formats that support these advanced workflows.

As businesses worldwide ramp up their investments in generative AI, Unstructured’s ability to streamline the data preparation process has become a game-changer. Studies show that data scientists spend upwards of 80% of their time preparing data, which creates a significant bottleneck in AI development. Unstructured’s platform dramatically reduces this time, offering continuous, real-time access to unstructured data that can be fed directly into LLMs, enabling more efficient and scalable AI development.

Series B Funding and Future Growth

Unstructured’s groundbreaking work has not gone unnoticed. In 2024, the company raised $40 million in a Series B funding round led by Menlo Ventures, with participation from notable investors such as Nvidia Corp.’s venture capital arm, IBM Ventures, Databricks Ventures, and angel investors including Vivek Ranadivé (Chairman of the Sacramento Kings), Chet Kapoor (CEO of Datastax), and Allison Pickens (New Normal Fund). This round brings Unstructured’s total funding to over $65 million.

Unstructured has formed a significant partnership with DataStax, focusing on enhancing AI and data processing capabilities. Their collaboration aims to simplify data preparation for AI applications, particularly in the realm of retrieval-augmented generation (RAG). Unstructured’s technology is now natively integrated with Langflow and Astra DB, streamlining the process of converting unstructured data into AI-ready formats. This integration enables developers to easily import and process unstructured data like PDFs, emails, and other document types, making them suitable for use in RAG applications. 

Additionally, by leveraging DataStax Vectorize, the integration allows for generating vector embeddings that significantly improve query relevancy in AI applications. DataStax has also updated their Astra Data Loader to support PDF files, incorporating Unstructured’s document-processing capabilities directly into the DataStax AI platform. Furthermore, a new Unstructured component has been introduced in Langflow, DataStax’s low-code development platform, allowing for flexible document ingestion across various file types.

Brian Raymond, CEO of Unstructured, emphasized the significance of this integration: “With our new, native integration with Langflow and Astra DB, we’re allowing AI developers to easily import and process unstructured data like PDFs, emails, and more. This enhanced capability sharpens query results and centralizes unstructured data handling within DataStax’s AI PaaS.”

Unstructured Platform

Unstructured has introduced the Unstructured Platform, an enterprise ETL (Extract, Transform, Load) solution tailored for the GenAI tech stack, addressing the data bottleneck faced by companies implementing generative AI workflows by continuously processing unstructured data into LLM-compatible formats. The platform offers rapid deployment, with GenAI ETL pipelines set up in under five minutes, and provides both no-code UI and API options for easy implementation. It automatically transforms complex, unstructured data into clean, structured formats, featuring dynamic transformation and enrichment pipelines for high-quality output and continuous, automatic data processing. 

With over 50 source and destination connectors, automatic detection of new data, and simplified API integration, the platform ensures cost-effective management of multiple API connections. It also offers enterprise-grade features such as SOC 2 Type 2, HIPAA, and GDPR compliance, advanced admin controls for data access, and options for in-VPC deployment for enhanced security. The platform’s “control plane” and “data plane” architecture ensure improved data management and security. As Unstructured puts it, “Unstructured delivers fast, high-quality data transformations, empowering organizations to deploy GenAI solutions with greater speed and reliability.” 

The platform is currently available for users to sign up and try for free, providing a user-friendly interface to build data processing pipelines without complex scripting. With extensive experience working with enterprise customers, Unstructured adds, “We have provided a number of enterprise features to ensure security and compliance.” The platform significantly advances ETL solutions for GenAI applications, empowering organizations to deploy GenAI solutions faster and more reliably while maintaining high standards of security and compliance. “We can’t wait for you to get going with the Unstructured Platform!”

Transforming the Future of LLMs

Unstructured’s journey is just beginning. As generative AI continues to evolve, the need for high-quality, structured data will only grow. By providing a seamless way to transform unstructured data into formats that LLMs can use, Unstructured is positioning itself as a key player in the generative AI space. With new advancements in data processing and continued integration with major AI tools, Unstructured is poised to become the go-to solution for enterprises looking to leverage the power of LLMs.

Brian Raymond, CEO of Unstructured, summed it up best: “For the first time, developers are able to interact with all of their data through large foundation models. The ability to ingest and preprocess human-generated data is a critical bottleneck in realizing the value of LLMs, and Unstructured is here to help organizations overcome it.”

Reflecting on his journey, Raymond shared, “We went from $0 to five in the bank off of a slide deck right and some great references. I had no designs on doing it but I was kind of pushed out of the nest so to speak. I made a bet early on with Unstructured, I was like look we’re not going to build anything in the first year of this defensible except for resolution on what the market wants, and the fastest way to achieve that is by building open-source.”

Brian also offered a personal insight into his approach to entrepreneurship: “I gave myself 90 days and enrolled us in like the benefit California healthcare plan. I was like okay, I’m going to give myself to the end of June and see if I can make this thing work; otherwise, I’m taking any job I can get.”

As 2024 unfolds, Unstructured is leading the charge in moving LLM prototypes into production, enabling organizations to scale their AI solutions and deploy them with greater speed and efficiency. The company’s technology is not only a crucial enabler for AI development but also a key catalyst for the continued evolution of the LLM landscape. The potential for Unstructured to serve as the critical enabling scaffolding between human-generated data and foundation models is incredibly motivating for the team.

The pace of adoption is accelerating at a rate that is sure to surprise many over the next 12 to 18 months, and Unstructured is proud to be a part of that momentum. Positioned at the intersection of generative AI and large-scale enterprise adoption, Unstructured is ready to shape the future of AI, transforming the landscape along the way.

Raymond credits much of his thinking to his background at the CIA, saying, “Our core competencies were can you write, can you brief, and can you think critically. Those were the three things—promotion panels were all around those three things.”

📣 Want to advertise in AIM Research? Book here >

Picture of Anshika Mathews
Anshika Mathews
Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co
Subscribe to our Latest Insights
By clicking the “Continue” button, you are agreeing to the AIM Media Terms of Use and Privacy Policy.
Recognitions & Lists
Discover, Apply, and Contribute on Noteworthy Awards and Surveys from AIM
AIM Leaders Council
An invitation-only forum of senior executives in the Data Science and AI industry.
Stay Current with our In-Depth Insights
The Most Powerful Generative AI Conference for Enterprise Leaders and Startup Founders

Cypher 2024
21-22 Nov 2024, Santa Clara Convention Center, CA

25 July 2025 | 583 Park Avenue, New York
The Biggest Exclusive Gathering of CDOs & AI Leaders In United States
Our Latest Reports on AI Industry
Supercharge your top goals and objectives to reach new heights of success!