How Far Can I Trust A Language Model With Anand S

By AIM Research
Published on September 22, 2023

CDO Insights

Let LLMs evolve as fast as they want, but if we evolve along with the LLMs learning how to leverage them, we have A- a powerful tool and B- less risk of being left behind.

Anand S, co-founder of Gramener, a pioneering data science company, is at the forefront of leading a talented team that specializes in transforming complex data into compelling visual narratives. Renowned as one of India’s top 10 scientists, he is a sought-after speaker at TEDx events.

His educational journey is marked by excellence, having earned a gold medal from IIM Bangalore and studied at prestigious institutions such as IIT Madras, London Business School, and gained valuable experience at IBM, Infosys Consulting, Lehman Brothers, and BCG.

Apart from his remarkable professional accomplishments, Anand boasts a diverse range of interests and pastimes. He’s achieved the remarkable task of manually transcribing each and every Calvin & Hobbes comic strip, has enthusiastically delved into the realm of Minecraft, inspired by his daughter, and nurtures a personal aspiration to view every movie listed in the IMDb Top 250 (excluding The Shining).

In this week’s CDO insights we wanted to understand the trustworthiness of language models, especially large ones like GPT-3, which is a critical and evolving concern. While these models exhibit impressive capabilities in generating human-like text, they also raise questions about their reliability, potential biases, and ethical implications. As these models become increasingly integrated into various applications, understanding their limitations and ensuring responsible usage is paramount to building trust in their outputs.

AIM: Why is this particular topic so significant to you, and what motivated you to initiate a discussion about it?

Anand S: The last time I was this excited by any technology was in 98, when I tried out Google as a search engine. The transformation for me, at that point was a change from, “I need to learn stuff to, I no longer need to learn stuff. I can look up practically anything.” And that is a huge leap. So, I changed my perspective at that point to say I don’t really need to know where stuff is, I don’t need to memorize things. I don’t need to build up a knowledge base. I can focus instead on analysis on figuring out how to do stuff.

What I see now with the introduction of Large Language Models and their ability to think, so to speak, is that now they can do the analysis. I don’t have to figure out how to do stuff anymore. That’s something that they’re good at and they’re improving at a rate faster than almost anything that I’ve seen. Which means that my job has completely changed. I no longer need to know how to do stuff. I need to figure out what to do.

Asking questions has become far more important now and how to ask and what questions to ask are the biggest changes. On a day-to-day basis it’s completely changed the way in which I’m approaching things. I have pretty much stopped coding by myself. It’s invariably programming with the language model. I have stopped researching by myself. I instead ask a language model to summarize anything that I’m researching. I’ve stopped ideating or brainstorming by myself. I invariably ask a language model to brainstorm with me. It’s almost like I’ve suddenly hired half a dozen assistants.

And they are interacting and working with me on a variety of topics. My only problem is now my imagination. So I among other things wanted to talk about this topic, if nothing else to prompt my imagination to go beyond what I’m used to and see how I can leverage the power of LLMs.

AIM: Why, after more than 25 years of experience, does the aspect of trust in technology particularly resonate with you and why is it so crucial?

Anand S: Technology used to be the realm of reliability. It used to be that you write a piece of code. It does exactly what you tell it to do. You may be giving the computer wrong instructions, in which case you have created the bug and it’s throwing out an error and you have to debug what mistake you made. By and large the computer doesn’t make that kind of a mistake. It’s just following instructions. Of course, things have evolved to the point where we no longer really understand the interplay in software. So it almost starts feeling like the computer is untrustworthy, that it’s making a mistake not us. But with deep learning and particularly with language models, where we are beginning to control systems, not through something as structured as code. But rather through something fairly amorphous as English. The way in which we take a lot of technology reliability for granted is now going for a toss. I can no longer believe. Earlier I searched on Google and by and large if it was just a source like a newspaper, I might believe it. At least the credibility of the newspaper is something I know how to evaluate but I’m not concerned about Google’s reliability as an intermediary. Today with a large language model I’m not sure what I can trust. What kinds of mistakes is it likely to make and why is it likely to make those mistakes? All of that is opaque and because this is a completely different way of thinking about computers in general and technology in general, I figured that’s something we should talk about.

AIM: What are the specific features or factors of large language models (LLMs) that contribute to their lack of trustworthiness or discomfort in their use, especially compared to earlier technologies? What makes LLMs, in particular, raise concerns, and how can we address these issues from the outset?

Anand S: Scale is definitely one thing. If you take Google Maps, for instance, people have driven into lakes just because Google Maps was giving them the wrong direction. I myself have found wandering near the bed of a lake taking a taxi in some really bizarre path that I’ve never been before. The taxi drivers never have been before, but we both trust Google maps and we just go ahead. And the fact that it happens, it’s so easy to trust something that is consistently right. It’s just a human element. How many people would believe that a parent would strangle a child and yet some do. And then you say, is it the fault of the child to have trusted parents naively? Similarly a technology that leads you in the right direction, even unexpectedly nine times out of 10. So you say there’s no way this route is right and Google maps actually proves to be right nine times out of 10. The 10th time, it proves wrong, even if I go into the middle of the lake, the next time, I am going to follow Google Maps. It works.

And that is the thing about scale, that it works so often and so ubiquitously that the standards just need to be higher. It’s certainly one aspect of this. But the second thing that keeps me cautious is how Llms work in general. It’s basically, take a word and try and guess what the next word might be. And then take those two words and try and guess what the third word might be. That’s literally what LLM is doing word after word.

In the process It also doesn’t deterministically pick the next word. You ask the same question next time, it may take a different set of words to continue from there at any point and that’s part of the model parameters. You can choose how much randomness you want and a certain higher degree of randomness leads to creating a model. A lower degree of randomness leads to a more precise model. In any case what you can’t avoid and don’t even want to avoid is randomness. But if there is randomness and there is true randomness, then that means, A- there’s less reproducibility and B- Less predictability. I can’t be sure that it’ll say the same thing again. I can’t be sure that I will be able to tell what it’s going to say next time. If let’s say a Mars Rover were being driven by a large language model, I would trust it about as much as I would trust another human being. Could work. But I don’t know what could go wrong.

AIM: How can we measure the level of trust an individual can have, and what framework can be established to address this significant question, especially when considering the vast scale of trust involved?

Anand S: I have no idea. Can we measure it? I hope so. But how? Absolutely no idea. And the thing about this is that it’s not because we don’t have frameworks of trust. It’s that our understanding of what constitutes trust in itself is pretty amorphous. Let’s take a Lie Detector test. If someone passes a lie detector test then would we say they are trustworthy? There are two ways of looking at it, one is to say, yes, because they pass the test, the other is to say, hold on, why did they have to be subject to a lie detector test in the first place? If somebody had been through a light detector test, there’s enough smoke that I would worry about that person in the first place. That’s one part of the kind of gray area we are in.

If we do start building models around trust, then it’s quite likely that they would be very domain specific. For example, if I have the language model that’s trained on having read medical journals then I could give it a test saying, can you correctly answer a set of abstract questions around medical technology maybe. Models have been tested on IQ but again it’s an open question, whether IQ constitutes trust or is something else. So what is trustworthy itself remains an open question and that I think will lead us to more narrowly define what we mean by trust and specific instances and then solve for those problems maybe rather than in a general way.

AIM: In light of the challenges in measuring ethical values and trust in technology, particularly within the context of data professionals who thrive on measurement, how do you envision fostering greater collaboration between social scientists, technologists, and communities to enhance trust in algorithms and promote digital literacy?

Anand S: How do we build trust in a model? I think a good way of thinking about a large language model is the way you would treat a smart friend, who’s read a ball of Wikipedia.

How would one evaluate such a person? Firstly, you would ask whether they’re able to answer a reasonable number of questions. Second, you see if those answers are verifiably right. And you’d also see if there are ways by which you can answer questions in a way that is more likely to lead to useful answers. Let’s take examples. So I asked on Monday evening, when the India Pakistan match was going on, who’s likely to win the India Pakistan match. And ChatGPT did a search and it came back with an answer that was equivocal: “Given the current situation even though India scored 356 runs. And the rain has stopped the play. The match could go either way.”

At which point I have an answer that is unlikely to be wrong. It’s saying, I don’t know. I mean how wrong could that be? But it is not necessarily useful.

So my next statement was “You are confident. Go ahead and make a prediction.” At which point it said, “In that case, I think there is a good chance that India will win because there have been, according to my database, only three instances where a country has scored more than 350 runs and has still lost the match. So India seems to be well placed.” And it also offered to do a search to see if the total statistics of matches where a country has scored over 350 and still lost and found only eight instances. This is one of those cases where my lack of trust comes not because it’s saying something that’s wrong. but it’s refusing to say something that might be wrong. Just goes in both directions. I do want it to stick out its neck and say how much it’s sticking its neck out.

The second is that I can nudge these models to behave in certain ways. I can say you are confident and it will answer more confidently. I can say, reason out your steps and it will be able to reason out those steps. So a part of it is evaluation. For us to have a point of view on whether the model’s output is likely to be right or wrong. Second is for us to know how to tell the model to improve its result. The third on top of that is aggregation over time to see how often it gets something right, what it gets right and build up a mental model based on that. One particular way in which this could be accelerated, is having bots with different personalities. If I had for example, a critical bot that in general distrusts something and I have a researcher bot. I get the two bots to talk to each other. The critical bot is likely to take that information and throw back possible errors. Have you considered x and y and so on? The researcher bot can then go back, check, find evidence, show it to the critical bot which can then poke further holes. Which the research can then further address. At the end of the process, when you have a separate bot, summarize the conversation, you will be able to find the set of issues that it has been able to resolve and a set of open issues that are still unresolved and get a more holistic perspective of this.

So, I believe that we’re not too far from a situation where we can use different bots with different personalities conversing between themselves to get something that is more trustworthy than any one bot with one personality can be.

AIM: How can we effectively assess the reliability of responses generated by Language Models (LLMs), considering the nuanced and often vague nature of ethical values associated with trustworthiness?

Anand S: Ethics is one of those things that is actually less of a problem I believe with such models. And the reason is ethics is largely still in the domain of humans. Say, I program a bot to speak profanity. It will. And that’s literally the instruction around it. If I program a bot to be gender biased it will be. Forget the program if you just instruct a bot to do it. And we have seen that LLMs can be trained to follow instructions really well. So a big chunk of the ethical value so to speak entirely lies in how we guide these LLM or put the other way what use we put these LLM. And the reality is that, so far the same set of tools that we have used to evaluate the ethics of a decision making process can be applied here. Let’s take security screening in the US. There was a time when security screening happened by people at the airports, taking a call. Some kind of profiling of a person saying, let’s do a random test of this person versus that person. How do we know that’s fair? Let’s say if we take fairness in that aspect of ethics that we are considering right now. We look at the percentage of people that have been pulled out for random profiling and see how well distributed that is. If it’s evenly distributed, if it’s biased towards certain areas. Amazon did this systematically with their machine learning models that were picking resumes for hiring. They said how often we hire women? And as a percentage, it was lower than men. It was because the data that they had fed in was from past candidates, many of whom were men and the model was following that similar process. And now that they realize it, they corrected it. But what are the kinds of biases that the model has that Amazon is not aware of. That’s a blind spot. It’s unknown. So put another way in which we would evaluate. The let’s say fairness of an LLM is no different from the way in which we would evaluate the fairness of a machine learning model that we had before or simple statistical model or human model. Look at the output. See if that works. And that is a powerful technique, which is to say, I’m not concerned about what it’s doing behind the scenes. If the output over time is sufficiently unbiased, then it’s likely doing a good job.

This kind of verification can be automated, nothing like it. If there’s enough data to do nothing like it but the reality is if it’s a niche use then it’ll take a long time to figure this out. But when we’re doing things at scale, there is going to be more data. It’s going to happen faster.

AIM: What are your final thoughts on the essential considerations for using LLMs in specific use cases after our discussion today?

Anand S: When we use LLMs, it’s not just the power of the LLMs that are growing. It’s our ability and knowledge of LLM’s that is growing as well. For many years, when somebody asked a question, I would just search on Google, find the first result, copy-paste the answer and send it to them. In fact, quite often, I still do that. I find that the first result just works.

It always surprised me. Why is it that so many people can’t do what is evident? Just search, find the answer and get the result? I suspect that it’s simply a lack of familiarity. They haven’t gotten it into their system. There is a way of getting knowledge different from asking someone. With LLMs, we are entering an era where the answer to pretty much any question can be found this way. So, as a closing thought, I’d say, let LLMs evolve as fast as they want, but if we evolve along with the LLMs learning how to leverage them, we have A- a powerful tool and B- less risk of being left behind.

📣 Want to advertise in AIM Research? Book here >

AIM Research

AIM Research is the world's leading media and analyst firm dedicated to advancements and innovations in Artificial Intelligence. Reach out to us at info@aimresearch.co

Subscribe to our Latest Insights