As enterprises increasingly adopt generative AI, managing sensitive data, especially first-party consumer data akin to PII, becomes crucial. Ensuring data privacy and security within cloud platforms is the first step before diving into predictive or prescriptive analytics.
In a recent roundtable, Aravind Peddiboyina, AI Innovation & Global Analytics Delivery Leader at Kimberly-Clark, Senior Director at a Healthcare Company, Avijit Chatterjee, Head of AI/ML at Memorial Sloan Kettering Cancer Center, and Vinod Malhotra, CTO at Perceptyx, shared their insights.
They discussed the challenges of data storage, security policies, customer data usage rights, and compliance with regulations such as HIPAA. Emphasis was placed on transparency, ethical considerations, and the need for strict data management protocols in the evolving landscape of generative AI.
Managing Sensitive Data: Storage, Security, and Analytics in Cloud Platforms
I wanted to share with you that we handle a large amount of sensitive first-party consumer data, which is almost like personally identifiable information. This data is stored on cloud platforms, with analytics being the next step after storage. Security policies are often discussed before any analytics are performed. Currently, we use Azure primarily, but all our first-party data from retailers is stored on the Google platform. Each platform has its own regulations, security, and cyber guidelines.
Depending on where your data is hosted and how it is treated, those regulations can vary by region. This is a natural progression as you move towards AI or generative AI, especially using OpenAI models. In certain regions, it is even restricted to use OpenAI services for modeling.
Within our internal team, we have built a Council that includes our legal and security teams constantly monitoring the regulations we need to apply for right AI Ethics and Governance. Because we deal with a lot of sensitive and confidential data, it’s crucial to understand where this data is stored and how it is managed.” All of our Gen AI data is strictly on a private end point RAG that allows it to be compliant with AI Governance and Ethics at the same time supporting All types of Data Classifications from Public, Internal, Sensitive & Confidential.
– Aravind Peddiboyina, AI Innovation & Global Analytics Delivery Leader at Kimberly-Clark
The Cornerstone of Data Usage Rights in AI and Sensitive Customer Data
One key aspect that carries over, not specific to just large language models or generative AI, is the usage rights obtained from customers. This has been in place for many years, especially when dealing with sensitive customer data. These usage rights become the cornerstone of how we can utilize that data. We often have to deal with customers who are very specific about the type of models we build and the functionality powered by their data. This gets very granular, which works well in a B2B scenario. However, when it comes to end customers, the approach becomes different, emphasizing transparency in data usage.
Transparency and visibility for consumers are paramount. The fear, to some extent, is rooted in how large language models (LLMs) can sometimes memorize data. Even though the model is trained to be generic, if it contains any customer-specific information, there’s a fear it could be inadvertently revealed during deployment.
In industries like finance and healthcare, dealing with large clients rather than individuals, clients are very much aware of industry happenings and the risks posed by nascent generative AI technologies. They are curious and want to understand exactly how the technology will be used. However, transparency is challenging because we often fine-tune large language models rather than build them from scratch. This element of the unknown on our side translates to their concerns.
When presenting a solution to individual customers outside of large enterprises, the environment is different. The primary action we can take is to offer transparency, as there isn’t explicit consent when someone interacts with a chatbot powered by large language models. Customers may choose to use it or not, without a contractual agreement, which impacts overall adoption.
– A Senior Director at a Healthcare Company
Strict Regulations and Ethical Considerations in Healthcare Data Usage
In healthcare, dating back to 1996, the government implemented a very strict policy regarding the use of patient data called HIPAA (Health Insurance Portability and Accountability Act). Essentially, HIPAA mandates that patient data must be handled with utmost seriousness. Consequently, whenever we engage with contractors or vendors, a business associate agreement is always established to ensure they take their responsibility of handling data seriously. Researchers have to complete training administered by Collaborative Institutional Training Initiative (CITI) and demonstrate a need to access specific patient data fields, before they are granted access by IRB.
The active use of data for research often involves identifying and de-identifying data, a process that can be challenging with unstructured data. There are vendors, such as John Snow Labs, that assist in this process. However, serious considerations must be made to ensure the privacy of human subjects is upheld and data is handled properly. Patient name, address, date of birth, phone, email and other personally identifiable and protected health information are subject to HIPAA requirements.
For regulatory purposes, there is a relatively newer regulatory framework from 2018, often referred to as “software as a medical device” (SaMD), which governs AI models in healthcare. Many of these SaMDs are radiology models due to the high usage of AI in radiology today. The FDA requires approval for algorithms used in patient screening. However, if a provider uses the AI model internally for patient treatment without giving the patient direct access to the algorithm, FDA compliance is not enforced.
In healthcare, it is critical to follow strict regulations and extra guardrails whenever AI is being trained or used. This ensures ethical considerations for fairness, explainability and compliance with regulatory standards are upheld.
– Avijit Chatterjee, Head of AI/ML and NextGen Analytics at Memorial Sloan Kettering Cancer Center
Customer Consent and Data Usage in AI Models
Even though there isn’t much government regulation in the US regarding the use of customer data for training large language models and other AI systems, enterprise SaaS customers have become highly sensitive to data privacy. They are also realizing that the data is a gold mine. As a result, many of them insist on explicit consent from them.
As an enterprise SaaS vendor collecting data, we must be very specific about what data we intend to use and obtain consent before using it. However, a concerning trend is emerging where many customers decline to give any consent at all. This creates a catch-22: without access to the data, how do we build robust models? Everyone wants to use AI powered systems, but they are very hesitant to share their data, even for development projects. This presents a challenge that many in the industry are facing.
– Vinod Malhotra, Chief Technology Officer at Perceptyx