In the ever-evolving world of Language Model Benchmarks (LLMs), a question arises about their reliability. Critics argue that LLM benchmarks can be unreliable due to various factors, such as training data contamination and models overperforming on carefully crafted inputs. Avijit Chatterjee, Head of AI/ML and NextGen Analytics at Memorial Sloan Kettering Cancer Center, offers an interesting perspective on this debate. He emphasizes that widespread technology adoption often speaks louder than benchmarks.
Chatterjee draws parallels between the LLM debate and historical database benchmarks, like TPC-C for OLTP and TPC-DS for Analytics. He notes that despite the fierce competition among database vendors in the past, today’s leader in the cloud-native data warehouse market, Snowflake, no longer relies on benchmarks to maintain its dominance. Similarly, in the realm of LLMs, the top contenders, including GPT-4, Llama 2, Orca, Claude 2, and Cohere, are vying for enterprise adoption without solely depending on benchmarks.
The issues with LLM benchmarks extend beyond reliability. Critics argue that these benchmarks are often too narrow in scope and fail to represent real-world use cases. The datasets used to train LLMs may not reflect the diverse and complex data encountered in real-world applications, leading to discrepancies between benchmark performance and practical utility.
However, the pursuit of human-like intelligence in LLMs remains a prominent goal. To achieve this, LLMs are put to the test in various multiple-choice question-based exams, such as the USMLE and MedMCQA. Notably, the few-shot variant tends to outperform the zero-shot approach, as vector embeddings provide more precise answers than relying solely on human judgment.
Microsoft Research conducted a comparison between GPT-4 and GPT-3.5 on USMLE self-assessment exams, where GPT-4 achieved a remarkable average score of 86.65% in the 5-shot variant, significantly surpassing the passing threshold of 60%. GPT-4 also outperformed GPT-3.5 and Flan-PaLM 540B by a wide margin on various medical domain benchmarks.
Assessing the reliability of LLMs is crucial for their application in clinical practice. In patient-centric scenarios, it’s essential to evaluate model calibration beyond overall accuracy or F-1 score. Calibration involves studying the confidence scores or predicted probabilities of model outputs against actual outcomes, helping to gauge model trustworthiness. For instance, GPT-4 demonstrated higher accuracy when assigning an average probability of 0.96, compared to GPT-3.5, highlighting the importance of model calibration.
Ultimately, while benchmarks serve as academic reference points, the true measure of LLM success lies in their adoption and practical applicability. LLMs aim to enhance human-machine synergy rather than pit man against machine in an adversarial manner. Trust, bias, drift, and explainability are fundamental issues that demand ongoing attention as LLMs continue to transform productivity across various domains.”
This narrative weaves together the perspectives and insights shared in the original text to create a cohesive story.