Tonic.ai, a pioneer in data synthesis solutions for software and AI developers, has announced the debut of Tonic Textual, the world’s first secure data lakehouse designed specifically for large language models (LLMs). This novel platform allows AI developers to effortlessly and securely use unstructured data for retrieval-augmented generation (RAG) systems and LLM fine-tuning, solving two major obstacles in corporate AI adoption.
🚀It’s Launch Day for Tonic Textual! Say goodbye to the integration and privacy headaches of #generative AI.#Textual is the world’s first Secure Data Lakehouse for LLMs.
— Tonic.ai (@tonicfakedata) May 28, 2024
Extract, govern, enrich, and deploy your #unstructureddata for #AI.
Support us: https://t.co/WcCsAq5WUB pic.twitter.com/OZrWgrYo14
The Untapped Value of Unstructured Data
Enterprises are investing heavily in generative AI efforts, motivated by its revolutionary potential. However, efficient implementation of this technology necessitates the usage of proprietary data, which is frequently kept in disorganised unstructured formats across several file types and contains sensitive information about customers, workers, and company secrets. According to IDC, unstructured data accounts for about 90% of enterprise data, with organisations estimated to create up to 73,000 exabytes in 2023 alone. To be used for AI projects, this data must be pulled from siloed places and standardised, which takes up a significant amount of development work.
“We’ve heard time and again from our enterprise customers that building scalable, secure unstructured data pipelines is a major blocker to releasing generative AI applications into production,” said Adam Kamor, Co-Founder and Head of Engineering at Tonic.ai. “Textual is specifically architected to meet the complexity, scale, and privacy demands of enterprise unstructured data and allows developers to spend more time on data science and less on data preparation, securely.”
The Importance of Privacy in AI
Data privacy is critical for businesses, especially when utilising third-party model services. According to the same IDC poll, 46% of firms consider data privacy compliance to be a major barrier when exploiting proprietary unstructured data in AI systems. Protecting sensitive data from model memorization and accidental exfiltration is critical for avoiding costly compliance violations.
“AI data privacy is a challenge the Tonic.ai team is uniquely positioned to solve due to their deep experience building privacy-preserving synthetic data solutions,” said George Mathew, Managing Director at Insight Partners. “As enterprises make inroads implementing AI systems as the backbone of their operations, Tonic.ai has built an innovative product in Textual to supply secured data that protects customer information and enables organizations to leverage AI responsibly.”
Introducing the Secure Data Lakehouse for LLMs
Tonic Textual is the first data lakehouse for generative AI, capable of extracting, managing, enhancing, and deploying unstructured data for AI research. Tonic Textual has several key qualities, including:
- Automated Data Pipelines: Create, schedule, and automate unstructured data pipelines to extract and transform data into standardised forms for embedding, vector database ingestion, or LLM tuning. Textual supports the most popular unstructured free-text data formats, such as TXT, PDF, CSV, TIFF, JPG, PNG, JSON, DOCX, and XLSX.
- Sensitive Data Protection: Automatically discover, categorise, and redact sensitive information in unstructured data, with the possibility of reseeding redactions with synthetic data to preserve semantic meaning. Textual employs proprietary named entity recognition (NER) algorithms developed on a variety of datasets to provide complete protection.
- Enhanced Data Enrichment: Use document metadata and contextual entity tags to increase retrieval speed and context relevance in RAG systems.
Tonic.ai intends to facilitate constructing generative AI systems using private data without sacrificing privacy, including:
- Native SDK Integrations: Use popular embedding models, vector databases, and AI developer platforms to build completely automated, end-to-end data pipelines.
- Increased Data Management Capabilities: New capabilities for data cataloguing, classification, quality management, privacy compliance reporting, and identity and access management.
- Data Connector Library: Native interfaces with cloud data lakes, object stores, cloud storage, file-sharing platforms, and business SaaS applications link AI systems to data throughout the organisation.
“Companies have amassed a staggering amount of unstructured data in the cloud over the last two decades; unfortunately, its complexity and the nascency of analytical methods have prevented its use,” said Oren Yunger, Managing Partner at Notable Capital. “Generative AI has finally unlocked the use case for that data, and Tonic.ai has stepped in to solve the complexity problem in a way that reflects its core mission to transform how businesses handle and leverage sensitive data while still enabling developers to do their best work.”