In the dynamic landscape of Gen AI, the role of data management stands as a linchpin, dictating the success and scalability of applications in their nascent stages. As organizations grapple with diverse data formats, privacy concerns, and the quest for quality data, the framework for an effective data management system becomes paramount.
To give us more insights on this, AIM Research interviewed Willem Koenders, a seasoned Global Leader in Data Strategy at ZS Associates. Willem brings over 10 years of experience advising top-tier organizations on leveraging data for a competitive edge. Certified in AWS, GCP, DAMA-DMBOK2, Willem’s expertise spans various regions, including Europe, Asia, the United States, and Latin America. At ZS Associates, he champions “data governance by design” in driving data-driven transformations. Previously, Willem held key roles at Deloitte, leading country operations in Nicaragua and Honduras and serving as Data Strategy & Data Management Capability Lead for Spanish LATAM. His extensive consulting background also includes five years at Monitor Deloitte, offering strategic insights into Corporate & Business Unit Strategy, Customer & Marketing Strategy, and (Digital) Innovation Strategy. Earlier experiences at the Dutch Ministry of Finance and Van Lanschot Bankiers highlight his versatile expertise.
In this discussion, we delve into the intricacies of data management in the realm of Gen AI, exploring the challenges, methodologies, and future trends that shape the foundation for robust and scalable applications in the ever-evolving world of artificial intelligence.
AIM: With your considerable experience in the data science industry, how do you perceive its evolution toward achieving AGI? Additionally, given your emphasis on data management, how has its role evolved in the building of this technology?
Willem Koenders: Generative AI, in various ways, represents the next wave of technology. While future sessions might introduce innovations beyond text-based applications, such as visual interpretation of images, the underlying reliance on data remains a constant. However, there’s an elevated risk in Generative AI compared to predictive quantitative models. While predictive models may yield incorrect predictions, Generative AI possesses an inherent characteristic that makes its responses appear elegant, masking the underlying biases or incompleteness in the data, which is not an ideal foundation for these models.
The complexity of Generative AI extends to its data requirements, often necessitating large, diverse, and high-quality datasets. The significance of diversity in data is particularly emphasized, as a biased dataset can influence the model’s outcomes, incorporating characteristics from specific demographics.
Despite the rapid adoption of Generative AI for everyday tasks like generating ideas or drafting emails, companies face challenges in taking foundational models and training them. The primary struggle lies not in the training process but in acquiring the right data from appropriate sources. Privacy and security concerns further compound the complexity of sourcing data.
Moreover, the computational and storage power required to train Generative AI models is substantial. Generating a single image, for instance, demands energy equivalent to charging a phone completely. Notably, initiatives like Google AI’s work require electricity on a scale capable of powering an entire country like Ireland. These considerations underscore the significant computational demands associated with supporting generative AI use cases.
In recent months, as companies recognize these challenges, there has been a tendency to hastily make promises. Now, in many instances, we see the seasoned data governance professionals reiterating their earlier stance, stating, “We told you so, you need the expertise in handling data.” Adding this part highlights the critical role of data governance professionals in addressing the challenges of Generative AI implementation.
AIM: In building Gen AI capabilities, data availability and client skepticism about data sharing are challenges. How do you address these concerns, and what practical solutions have you implemented to foster innovation when clients are reluctant to share data?
Willem Koenders: There’s not a quick way around any of this, nor are there some kind of similarities here between Generative AI and other AI or any other kind of data-driven applications and use cases. Still, if I think about it now, three things come to mind.
One is to proactively address this privacy and security component. Two is to buy data if, depending on what you’re trying to do, the data is already managed by someone else, and three you use synthetic data.
There are a lot of cases. The first one is if you require the data, for example, to personalize the communication that goes back in an email or under the call centre, you’re going to have to get access to some bits of information that will help you do that. So, there are many other use cases like this.
You can’t just ignore it and just do it anyway and have these long-scrolled policies and standards that people don’t trust anymore. It only has to go wrong once, for you to have reputational damage that won’t go away for years. Be proactive here, have transparent policies, clear short ones, not very detailed legal sounding ones, figure out your data security measures and be proactive in your messaging. That’s one thing I would say. I don’t think it works to try to ignore it or focus on other things, but you can clearly communicate why you need this data and how it would benefit people, and then they can make their own considerations. Is it helpful or not? I’ve worked for years with these general chatbots. I can’t stand them. I hate it. I’d love for people to have some data for me to just shortcut.
The second part was to buy the data. Depending on what you’re trying to do, there’s an enormous amount of data that many parties gather and sell. Geolocations, behaviours or other types of data that you can consider circumventing, try to get from your customers or clients and see if you can use another data set to train your model.
The final one, which I think is the most interesting one specifically for Generative AI because I used it, too, is synthetic data. All that you want to do to train your generative AI model or your large language model, is to have the right data representative of how it could be in real life.
We’ve had a solution that we built ourselves where we need it to kind of assess the maturity of companies in their data capabilities, which, to a very large extent, depend on deep-dive interviews. Instead of going through companies we worked with and using their private information, which is hypersensitive because it could have gaps in data security and regulatory concerns. Generative AI, it is almost ironic to create synthetic data even, including grammar mistakes and other things, as thousands of inputs that we then used to train another different, separate, independent LLM model. So there are at least three ways to be creative and address these privacy, security and comfort level considerations.
AIM: When buying or generating synthetic data for Gen AI applications, does it diminish the customization for a company’s specific use case? What are the consequences and challenges of using alternatives rather than company-specific data for customization?
Willem Koenders: I don’t think there’s a general answer when you have a highly specific product that only you sell, and you need your client interactions, and you need to be able to train them on that. There are a couple of examples out there, and then there’s no alternative. One that isn’t quite relevant, but if you think about how Tesla’s training their cars, they can’t get any other visual. It doesn’t work; there isn’t anything else.
But I think you absolutely could use the example that I just mentioned of the solution we built. You can ensure that it is not biased even if we buy it on five, six, or seven client experiences. That’s still biased. It might still come from one sector, but now we are able to create a much larger fact-based base to work from. But it depends on the use case. I don’t believe that there’s an answer to cover them all.
AIM: Have you developed a framework to guide your team in defining a healthy data management system, including the pillars and key characteristics of a robust data architecture?
Willem Koenders: It’s a timely question and a personal passion of mine. I’ve worked for over a decade building frameworks, approaches, and methodologies to execute data maturity assessments. On one hand, there are capability areas like data architecture, data quality, metadata management, storage, and operation — all components that make up data management. On the other side, for each of these, there are dimensions I typically use: strategy, people, process, technology, and adoption.
DCAM and DAMA are examples. At ZS, we built one specific to Generative AI, interpreting it for all capability areas. This framework delves into what’s needed to activate Generative AI appropriately in a company. You can use ours or build your own. In the marketplace, you can create a checklist per this framework with two categories of capabilities: enterprise and use case-specific.
Enterprise capabilities are needed generally, not for each use case. Examples include roles and responsibilities of the COE, foundation model access, and platform capacity. Use case-specific capabilities, on the other hand, are tailored for specific scenarios. For instance, Bloomberg trained a foundation model specifically on financial data, while a copilot tool for employees might be used out of the box.
There are a couple of excessively use case-specific things, such as the modeling experience and data considerations like volume, diversity, annotation, historical data, and data quality capabilities. These can be captured into a framework of 8 to 15 capability areas, creating a report or checklist. If all of these are green, then you’re good to go and can build your Generative AI use cases. That’s how we’ve approached it.
AIM: Is the quality of data a significant challenge in building a healthy data management system for Gen AI applications, even with a well-designed architecture and system in place? Do you believe that in today’s world, there is readily available data that meets the requirements for constructing high-quality Gen AI applications?
Willem Koenders: Firstly, it’s no different from any other process. Not just AI or data-driven applications, but any sort of business process. You need to start by sitting down and figuring out what you’re trying to do. What kind of data would I require to be able to rely on to be able to train my model in terms of whether it is structured or unstructured? Then take very traditional data quality dimensions like completeness, validity, accuracy, and timeliness. You need to sit down and define it as best you can in business rules or other sorts of requirements that you need to be able to rely on. Ideally, you can check that, just say it, and then take the data. You can quantitatively check to make sure, and that’s kind of a step that you need to do, which, in my view, is no different than it would be for pretty much any other business process or AI model.
It’s using the same data quality framework you hopefully already have. So, do you have the right policies and the right standards? Do you have a framework? Do you have rules and responsibilities? For example, if you have a model, a process, or a transformational project that does Gen AI, it just needs to trigger data quality consideration. You may just need someone who comes in and double-checks that you thought about it. You could still measure it in a dashboard. Do you have to? I don’t know. But whatever you do, do it close to the source because, especially with Gen AI, I imagine that it might be very tempting to take data and not fudge it but almost run an LLM application to clean it up and train it. Don’t do that. Keep it at the source to have a reliable foundation for your model.
And then maybe one final thing that comes to mind, which I do think is different for Generative AI than it is for other applications, is to measure not just the input into the model but also measure some part of the outputs like the actual responses that it gives and where you see that the quality of the response is dipping figure out why. Is that because it didn’t have certain data? Is it because it looked at the wrong thing? There is a feedback loop here where you should look at the outputs and have an input back into your input data.
AIM: In terms of the variety of data for Gen AI, is it mainly about different data formats coming together to generate content? If so, what methodologies and challenges are involved in streamlining this process, and what key areas should be adopted to enable seamless integration of diverse data formats for Gen AI applications?
Willem Koenders: Generative AI, with its diverse flavours, tends to require access to various types of data when averaged out in a general sense. For instance, in a call centre use case, historical information, product details, and possibly individual-specific data must seamlessly integrate based on the specific use case. This integration may necessitate real-time access or retrieval on the spot, which becomes a classic challenge for organizations, especially those of minimal size.
In this domain, I acknowledge the complexity and understand why it is challenging. There is also a notable influx of companies claiming to have the perfect solution. Personally, I receive, on average, at least two inquiries per week from such companies in the data platform and integration space, asserting they’ve addressed the challenge. While many offer promising solutions, my skepticism persists, as their effectiveness often shines in demos but proves challenging in practical implementation, especially when interfacing with databases like Oracle or storage systems like S3.
However, what intrigues me is the intersection of data and AI. There are instances where AI can contribute to data integration challenges. There are promising trends where AI-powered data integration tools aid in scanning, discovering metadata, and interpreting and understanding data. While these solutions are not plug-and-play, they do work, and giving them time for further development might yield more reliable implementations.
Despite my skepticism, I find excitement in the evolving landscape. As someone passionate about data governance, the prospect of simplifying tasks that have historically been challenging, such as digging through ETL scripts and meeting recordings for definitions, is intriguing. While some aspects still need refinement, the progress in AI-powered data integration tools holds promise for the future.
AIM: Looking ahead, given the current early stages of Gen AI and the continued significance of data, what future trends do you foresee in the evolution of data management systems? How do you anticipate these trends aligning with the development of Gen AI and overall advancements in data-driven technologies?
Willem Koenders: The trends related to data management for or to support Generative AI and its activation bring several considerations to mind. Firstly, if we reflect on when the data governance discipline matured significantly, it can be traced back to regulatory requirements, such as in banking with liquidity reporting. A similar trajectory is anticipated for Generative AI as regulations come into play. In Europe, the EU AI Act exemplifies this, imposing safeguards and bans on certain AI applications, prompting companies to showcase their data management practices to comply with regulations. This regulatory landscape, though not immediate, will evolve over time, with fines for non-compliance driving an uptick in regulatory and ethical considerations.
The second aspect involves potential frustrations for companies as data limitations persist in hindering progress with Generative AI use cases. There’s no quick solution to overcome these challenges, and frustrations may arise among business leaders attempting to navigate this complex terrain. However, there’s an opportunity to observe startups and smaller companies that find innovative ways to train models and build products and services without being encumbered by legacy data issues, possibly leveraging synthetic data for rapid scaling.
As previously discussed, another significant trend is not just about data for Generative AI but how Generative AI can assist in data management. This includes tasks like metadata creation, analyzing emails and transcripts, and distilling insights, marking a trend to closely monitor for its potential impact on the broader field.
Additionally, in certain industries, such as European banking, evolving regulations demand a shift in how data is handled. Open banking standards require banks to share some data with third parties. This opens opportunities for startups to capitalize on these changes, moving quickly, getting it right, and scaling efficiently. Keeping an eye on such developments can be crucial for staying ahead in this dynamic landscape.