Data Quality vs Quantity issue with ML and AI applications with Arvind Balasundaram

By AIM Research
Published on August 21, 2023

CDO Insights

As AI and Machine Learning gain momentum, the spotlight on data quality versus quantity intensifies.

In this week’s CDO Insight series we have with us Arvind Balasundaram who with over twenty years of experience, is a prominent figure in the realm of commercial insights and analytics. His extensive career has seen him specialize in global customer insights, strategic analytics, and market research. Renowned for his adept use of data and analytics, he fuels actionable insights and strategic choices within the pharmaceutical sector.

For more than five years, Arvind has been at the helm of Regeneron Pharmaceuticals’ Commercial Insights & Analytics division, playing a pivotal role in the company’s expansion and success. Prior to joining Regeneron, he held distinguished positions, including Head of Global Customer Insights at Sanofi and Senior Director at Johnson & Johnson.

As AI and Machine Learning gain momentum, the spotlight on data quality versus quantity intensifies. The interview sheds light on the pivotal role of data quality versus quantity in AI and ML. It emphasizes the challenge of persuading business leaders to prioritize quality over quantity, and the delicate balance between AI’s effectiveness and transparency. Leveraging expertise in Commercial Insights & Analytics, the interview underscores the link between high-quality data and successful AI outcomes. It explores strategies for advocating data quality despite budget constraints, focusing on long-term benefits of enhanced AI performance and efficiency gains.

AIM: Why is there a growing talk about data quality in AI and Machine Learning projects? How does this relate to changes in technology and business? What role does data quality play in today’s world of Generative AI?

Arvind Balasundaram: Firstly, thanks to AIM for hosting a discussion around this topic. Also, I want to add that although most of my thoughts here pertain to AI and ML projects in general, they have specific relevance for applications in Analytics in a business context.

As ML and AI applications become more mainstream in their use, and are implemented at scale, there is a growing need to address “feasibility” in addition to model explainability and precision. A very recent article in The Economist highlighted this issue by pinpointing that the “bigger-is-better” approach to AI is running out of road, and that the notion of “better” in the forward evolution of ML and AI will need to rely on doing more with less.

Most of the focus around early AI/ML development have singularly focused on engineering elements like model structure and performance. As these applications entered the real world, much of the attention (especially with Generative AI) has started to include a closer look at explainability of these models and how to minimize issues such as bias and incidence of hallucinations. The issue of feasibility has largely been an afterthought until now.

As businesses evaluate the degree to which ML and AI initiatives should form part of what I will call their “corpus of future capabilities at scale”, they are increasingly demanding inclusion of cost-return tradeoffs in addition to business risk. This is not unexpected – all technologies go through this cycle as they migrate from earlier innovation stages (idea generation, early championing, model/platform building) to later ones (evaluation in the real world, filtering against objectives, resource assessment and deployment). For example, as business analytics applications mature to multi-layer deep learning and neural network applications, or even multimodal GAN and transformer models, data requirements are also becoming an issue for practitioners to contend with. Budgeting for such capabilities at scale is demanding more from practitioners to better articulate investment rationale from a value perspective. This is easier said than done in ML/AI today.

These demands will only continue to grow with Generative AI implementation. Several features of this capability, such as: (i) the potential to generate new artifacts without being constrained to repeating characteristics recognized in training data inputs only (ii) the availability of a contextualization logic to generate entirely new content, and (iii) the possibility to consider multiple modes for output generation (video/audio/rich media in addition to numbers and text) have greatly increased the consideration of these capabilities in the corpus for a variety of business applications. But the relationship between quantity of input data requirements (which ones are “necessary” versus “nice-to have”) and quality of output eventually generated remains murky and ill-defined. This must change forward for more accelerated business adoption and scaled implementation.

AIM: Instead of just focusing on AI results, how can we highlight the importance of having good quality data as input? What ways can we explain this to business leaders to justify investing in better data than just more data?

Arvind Balasundaram: This is a very important issue, and one that most business leaders are familiar with. Simply put, in every business context where a funding case needs to be made, investments (and AI investments at scale will not be an exception to this) will be viewed as a “constrained optimization” problem in the organization. A persuasive case must definitively address: (a) how the enterprise can realize maximum return on that investment, and (b) why any proposed investment makes more sense than more traditional alternatives in the investment basket. This is the appropriate business context against which AI investment rationale will be considered. Consequently, highlighting the acquisition of quality data inputs needed to optimize output quality and deliver greater business impact, will need to become more front and center.

All AI models (Generative AI or otherwise) will be considered as such, especially with respect to the decision to scale. Foundation models, for example, have extensive data requirements, which are not always open source. In most industries, for example healthcare, source data is not getting any cheaper. In addition, many of these sources require linkages, or tokenization, which only increases cost requirements for preparing the input layer. The “bigger is better” approach, with exclusive focus on input volume and precision of AI-generated results, does not address the feasibility criteria articulated above and will not be persuasive for justifying AI-at-scale investments.

Instead, a clear calculus that speaks more directly to the relationship of AI inputs and will help frame this conversation for more widespread scale adoption. These models are still largely viewed as “black box” engines by decision-makers, where more data ingestion is seen as the only way to improve model performance. Practitioners need to reframe this approach so it also incorporates what they can live without. This will require a more rigorous benchmarking of the frequency with which certain data sources contribute incrementally to output quality and inference in repeated applications. Additionally, a way to directionally document the specific contribution of individual data sources to the output generation process would go a long way – for example, do some specific data sources differentially propel some unique facet of eventual model learning? Or can data elements from some other data layer be attributed to helping further refine learned input via a superior backpropagation process? This is akin to what explainable AI approaches are trying to do to try and supply evidence for output accuracy. Except here, the focus would be more on feasibility. I realize this is very difficult with AI pattern recognition and remain aspirational, but even some directional proxies (via benchmarking) for guidance would be useful.

AIM: How can we ensure that AI’s pattern recognition, which can be hard to understand, doesn’t create transparency issues with the data it uses? Are there ways to make AI more understandable without making it less effective? How does that relate to the quality of the data?

Arvind Balasundaram: This is a central mandate of explainable AI, specifically exploring a way to always deliver outputs that users can understand and interpret. This motivates the same connectivity of the process and its layers to the outputs delivered, as I was alluding to previously. An important principle in explainable AI is the principle of knowledge limits. Essentially, any AI system should only operate under conditions for which it was designed and resist supplying outputs when there is a lack of confidence in the result. This is easier said than done in generative AI applications, since the conditions would need to hold for the training data as well as the contextualization created for enabling newly generated outputs.

However, the underlying idea here of restricting the process (inputs that are fed, layers used in the process, etc.) is refreshing, as is the introduction of the idea of confidence as an evaluative metric in the result. Herein lies a possible avenue for the important question you are asking – making AI more understandable without making it less effective. It is directly related to the quality of data issue as well. How so? If the practitioner’s objective is to resist supplying nonsensical or hallucinatory outputs, one must start with reining in on sourced input. Additionally, the only way to maintain the reins on a process where the practitioner has limited line of sight into a network’s hidden layers might be filtering data inputs on a quality basis. Yes, this reduces precision of the results especially for generative applications; but is this the tradeoff that AI needs for creating transparency without curbing effectiveness – slightly less precision to gain a more reined in result-generating apparatus, defined as one where the quality of the source contributes in some measure to understanding and interpretability of result? For example, if one of the tenets driving input data quality is the application of data minimization filters to supervise that input data sources are mostly compliant with business requirements and privacy safeguards, this will help address some data transparency issues. Of course, it would also be nice to have some algorithmic transparency that addresses purpose, structure and action of the AI layers used, but this is easier said than done.

AIM: With your background in commercial Insights & Analytics, how can companies show the connection between high-quality data and successful AI outcomes to decision-makers?

Arvind Balasundaram: The idea of somehow connecting the contribution of an input feature to the performance or robustness of the output realized is not a new one in the Insights & Analytics discipline. This idea has been extant even in the small data world, where for example, it was incumbent on a practitioner to somehow relay how much a specific input or combination of inputs aids in the output delivered. Take, for example, the role of the principal components in a classic multivariate analytics application. The intent here is to somehow characterize the many dimensions/features that might be present in each observation (and therefore, the collective set of observations) in a large multivariable dataset. Essentially, what directions or combinations in the available data maximally explain the amount of variance in that data. Principal components are usually estimated using an iterative algorithm whose performance is gauged against some convergence criterion. The setup works reasonably well for exploring data, predicting data, and especially to reduce dimensionality in data.

I bring up principal component analysis (PCA) reference more as an analogy here, than as a concept that we should literally translate for AI decision enablement. It would be interesting to explore if some such construct could be developed for AI-based processes and model structures. It would obviously lack the statistical rigor of PCA, but the intention to somehow connect the input to the output using some target criterion would be preserved. For example, is there a way to examine the resilience of a data feature or combination of features in driving output quality, over repeated applications? Alternatively, is there a way to characterize patterns that repeat more often than others, and determine which data sources might contribute most to pattern dimensionality?

Once again, these are aspirational thoughts, but if realized even in a proxy fashion, would go a long way towards helping practitioners decide on which data sources are must-haves versus nice-to-haves.

AIM: Data costs are going up while storage costs are going down. How can organizations make a case for spending resources on improving data quality, especially when budgets are tight? Can you share the long term benefits of this investment?

Arvind Balasundaram: Organizations always have an appetite for arguments predicated on value, since this shifts the conversation from costs alone towards a return-on-investment (ROI) focus. This is especially true when budgets are tight. This line of thinking is also preferred by finance chiefs, for reasons I articulated earlier. A return-focused narrative redefines all investments in their consideration set to a common denominator, enabling an apples-to-apples comparison, and a more objective evaluation forward.

Here is where a quick pivot is needed with AI-scale investments. These are largely still being positioned as value investments on a pure outcomes-only basis, with little reference to associated data sourcing costs, linkage infrastructures, platform technologies, etc., especially to keep the engine running continuously forward. With the rise in real-time delivery of business insights, the need to satisfy those expectations forward means continuously investing in evolving AI forward. What is the expected rate of return in doing this?

As the granularity of data sources increases, organizations face important tradeoffs in assessing usefulness of this improved granularity in delivering successful business outcomes. Seen in this light, AI is an enabler of augmented decision-making, capable of uncovering novel opportunity spaces that might not be evident otherwise. But the conversation must also address at what cost scale can be achieved, and how differential and first to market is the associated return. Why is the marginal dollar of enterprise investment more attractive to spend on AI-enablement initiatives versus other demands in the basket of forward investments?

Only with a flavor of feasibility can the case of more investments on AI become more compelling. When all the efficiency gains associated with AI wear out (for example, task efficiency and precision with generative AI), organizations will increasingly expect effectiveness and accountability pillars for justifying longer term, continuous investments.

AIM: Can you sum up what you see as the future role of data quality in AI and Machine Learning? In just one sentence?

Arvind Balasundaram: John Ruskin said, “Quality is never an accident. It is always the result of intelligent effort.”

The same holds true for being deliberate and purposeful when selecting data inputs for ML/AI implementation.

📣 Want to advertise in AIM Research? Book here >

AIM Research

AIM Research is the world's leading media and analyst firm dedicated to advancements and innovations in Artificial Intelligence. Reach out to us at info@aimresearch.co

Subscribe to our Latest Insights