Council Post: Data Privacy and Vendor Trust in the Age of Gen AI

By Abhijeet Adhikari
Published on June 12, 2024

Council Posts

These data sharing challenges are not new. The concerns always existed about how to share the data.

As generative AI technologies like large language models (LLMs) become more prevalent, organizations across industries are grappling with how to leverage their potential while mitigating risks around data privacy, compliance, and intellectual property. The rise of powerful AI systems that can generate human-like text, code, and other content have opened up new opportunities but also raised critical concerns.

To explore strategies for responsibly adopting generative AI while protecting sensitive data and proprietary models, we convened a roundtable of distinguished experts and industry leaders. Our panelists included Sreenivas Gadhar, Vice President, Global Data and Analytics – Engineering and Delivery at Marsh McLennan, Arjun Srinivasan, Director – Data Science at Wesco, Elanchezhiyan (Elan) Panneerselvam, Sr. Director Data & Adv. Business Analytics at Ally, Anil Prasad, Head of Engineering – Software Products, Data and ML at Ambry Genetics and Ashwin Sridhar, Chief Product & Technology Officer at Common Sense Media.

The key topics discussed included data sharing challenges, data minimization and anonymization approaches, cloud adoption considerations, integrating on-premises and cloud data assets, and balancing the opportunities and risks of generative AI deployments.

These data sharing challenges are not new. The concerns always existed about how to share the data. Should it be shared in encrypted format or to mask it? It’s not just about sharing data outside your organization; even within your organization, from production to lower environments, these challenges exist.

Some of the things I/we have done, on a smaller scale of data, include encrypting the data. Another option is stripping off any PII (Personally Identifiable Information) completely and anonymizing the data using different techniques.

While providing the data for model execution, it is important to understand the purpose for which these models are built for and how the models will use your data. This will help to share the appropriate data, and share it with confidence without getting into any contractual, copyright and privacy issues.

If you don’t share the right information needed for the model, the productivity of the models will be impacted. However, sharing information that isn’t relevant for the models is also counterproductive.

– Sreenivas Gadhar, Vice President, Global Data and Analytics – Engineering and Delivery at Marsh McLennan

Strategies for Data Minimization and Anonymization in Predictive Modeling

Implementing data minimization and anonymization is essential. For example, if you’re doing predictive modeling for an HR function, it is crucial to strip away all unnecessary information before feeding your data into a model. This process exemplifies data minimization and anonymization. If the data contains Personally Identifiable Information (PII) or Protected Health Information (PHI), exclude it if the model only requires the data for prediction or analysis to support decision-making. Focus on providing the minimum viable data needed for the vendor, especially given current regulatory restrictions. Anonymization is critical, and using synthetic data can also be an effective approach.

Talking about synthetic data for anonymization, you can generate the best representative data and use that for your predictive modeling needs (after removing any potentially re-identifiable information), for any sensitive applications like healthcare analytics, financial risk assessment etc. This is even more critical if you’re using a vendor for your modeling needs or if you’re using managed services for your GenAI needs.

– Arjun Srinivasan, Director – Data Science at Wesco

Addressing Cloud Adoption and Data Privacy Concerns in the Financial Sector

Everybody wants to know how we’re going to proceed. If you think about when cloud technology first emerged, banks were not interested at all. I remember in 2010 or 2009 when cloud technology came out, I was working at a big bank, and in an executive meeting, they said there was zero chance of adopting it. But if you look at it now, everything is on the cloud. It took us four to five years just to consider whether to adopt it.

I’m seeing a similar trend, especially in the banking sector and even the insurance and financial sectors. Data privacy is the biggest concern. How are we going to protect consumers’ data? There are a lot of regulations out there, but I’m not sure we have anything specific for AI yet. The biggest concern is data privacy, followed by intellectual property.

Every company has its own models, especially in risk management and consumer underwriting, and they don’t share those. With the emergence of large language models (LLMs), we don’t know how things will evolve. Now, every company wants to build its own AI models internally and work with vendors, but nobody is ready to fully commit yet. They’re still just thinking about summarizing the data.

– Elanchezhiyan (Elan) Panneerselvam, Sr. Director Data & Adv. Business Analytics at Ally

Integrating On-Prem and Cloud Data: Compliance and Implementation Strategies

There is a good amount of PII and generative data on-premises, and then there is also the workload or data load in the cloud. So, how do you bring both together as a single asset and aggregate the information when you build a model? That’s one part. I’ll come to the third part where I talk about implementation strategy.

The first part is about how we navigate and make the data available, whether it’s sitting in the cloud or on-premises. The second part broadly covers compliance, which includes legal and security aspects. When we implemented LLM in my previous company, we did not have an AI governance policy. So, what happens if there’s a breach or data leak when people use LLM or AI? We had our own implementation of generative AI and enterprise solutions.

Without a written policy, managing the risk of a data leak was a significant concern. There were also questions from security, ethics, and legal perspectives about how to protect customers’ data, patient data, medical data, and financial data. Compliance with regulations like RPCA is essential, but it doesn’t mean we are fully certified to use client data for leveraging generative capabilities. The third part involves the overall implementation, which comes with many challenges.

– Anil Prasad, Head of Engineering – Software Products, Data and ML at Ambry Genetics

Balancing Opportunities and Risks of Generative AI in Education and Publishing

Much of our work involves providing information through our website. This includes media ratings and reviews, as well as in-class curriculum for students (focused on digital citizenship and AI literacy) and professional development for educators.

The main challenge with generative AI in its current form is its lack of inherent intelligence; it only appears smart because it’s trained on vast amounts of data, creating a misleading perception of accuracy. Since we can’t be sure how our data, once ingested by AI bots, will be used, we’ve opted to make our content inaccessible to these bots for now.

For instance, we see potential in leveraging our structured and unstructured data, such as the reports and analyses we’ve produced over the years, to deliver insights using generative AI. We’ve experimented with this approach and, while the results were generally good, there were edge cases where the ‘insights’ generated by AI weren’t statistically sound. Relying on such information can be problematic.

In education, I believe generative AI can be a valuable tool for educators. Through our work with schools, we’ve identified significant potential for AI to enhance educator efficiency and support. However, I don’t think generative AI is ready for in-class use yet. Students might struggle to distinguish right from wrong, as not all generative AI output is correct all the time. Generative AI output is best suited for discerning individuals who can think critically about the information presented to them. Unfortunately, that’s not the mindset most students are in when learning and as such AI could reinforce incorrect answers. While I think we will eventually reach a point where AI is reliable for classroom use, we’re not there yet.

– Ashwin Sridhar, Chief Product & Technology Officer at Common Sense Media

Disclaimer- All views expressed by the leaders are personal and should not be considered as attributable to his employer.

📣 Want to advertise in AIM Research? Book here >

Abhijeet Adhikari

Abhijeet Adhikari is a Research Associate at AIM-Research, focusing on AI and data science related research reports. Beyond his professional role, Abhijeet is an avid reader with a particular interest in historical and mythological facts, you can reach him at abhijeet.adhikari@aimresearch.co

Subscribe to our Latest Insights