Navigating the Diversity–Validity Dilemma: Machine Learning Innovations for HR with Lindsey Zuloaga

By Abhijeet Adhikari
Published on June 27, 2024

CDO Insights

“There are things that humans excel at and tasks where algorithms are more efficient.”- Lindsey Zuloaga

Navigating the diversity-validity dilemma in machine learning for human resources is pivotal as organizations strive to enhance hiring processes without perpetuating biases. This challenge involves balancing the predictive accuracy of AI assessment tools with their fairness across diverse demographic groups. Addressing this dilemma requires innovative bias mitigation techniques, robust ethical frameworks, and interdisciplinary collaboration to ensure AI tools align with ethical standards and organizational goals. As AI continues to evolve within HR, fostering a culture that values both diversity and high-performance standards will be crucial for the ethical and effective deployment of these technologies.

This week, we had the chance to gain valuable perspectives from Lindsey Zuloaga, the Chief Data Scientist at HireVue. She is an AI thought leader renowned for expertise in data strategy and innovation. Esteemed as a trailblazer in utilizing AI to transform recruitment processes, enhancing fairness and promoting equal opportunities. Possesses a significant presence in media and public speaking, especially in discussing AI ethics, transparency, and the impact of AI on technology.

AIM Media House: To begin with, could you share some insights about HireVue and your specific role within the organization?

“We pioneered the concept of asynchronous video interviews many years ago and have since developed numerous tools that help make the hiring process more efficient.”

Lindsey Zuloaga: HireVue has been around for 20 years, but we started off quite small. Our founder was seeking a job while completing his MBA and faced difficulty securing an interview at Goldman Sachs—even though the company’s office was practically across the street from his school. The reason was that Goldman Sachs did not conduct interviews at his institution, a small liberal arts college, preferring instead to recruit from larger, more well-known universities. This experience inspired our founder to wonder why companies couldn’t broaden their hiring funnel. His idea was to enable more people to interview by sending them webcams to record their responses to set questions. This would allow companies to review these interviews more efficiently without the need to physically visit select universities. It’s a compelling story because Goldman Sachs is now a big client of ours and has expanded its recruitment to many more schools, thanks to advancements in technology. We pioneered the concept of asynchronous video interviews many years ago and have since developed numerous tools that help make the hiring process more efficient. One such tool uses AI or machine learning to assess the language candidates use in video interviews, focusing on how they handle competency-related questions or challenging situations with customers. We also offer virtual job tryouts, games, scheduling tools, and coding challenges to make the recruitment process more efficient and streamlined.

Personally, I come from a physics background, which is common among many in data science. I entered the field before dedicated data science programs were established. Many of my peers from STEM backgrounds found transitioning to data science relatively straightforward given our strong mathematical foundation. I’ve been with HireVue for about seven and a half years, and it’s been fascinating to engage with this sector as someone with a technical background.

AIM Media House: Given your role and considering the rapid advancements in technology as of 2024, what are the primary challenges you encounter while scaling AI solutions in the crucial area of hiring?

“Often, people assume that the algorithm will make hiring decisions for them or tell them what to do, or even that it might replace their jobs.”

Lindsey Zuloaga: One of the most significant challenges we face is change management. Hiring practices have been established for a long time, and many legacy systems operate in specific ways that people are accustomed to. Changing these established methods can be difficult. However, entering a well-established field comes with many benefits. I’ll discuss more about the fairness and expectations regarding the validity of these algorithms and how they are supposed to function. We have a lot of established standards, which is beneficial, something that many fields lack.

On the other hand, what people are used to, what they expect, and their understanding of technology can pose challenges. The level of tech-savviness among users of these tools varies significantly. In our case, the tools are designed to support human decision-making by providing more data, which helps in making informed decisions. Often, people assume that the algorithm will make hiring decisions for them or tell them what to do, or even that it might replace their jobs. However, our technology isn’t built for that. It’s designed to automate the mundane tasks involved in hiring—like resume reviews or initial phone screenings, which many companies automate. We focus on assessing candidates so that human recruiters can spend more time interacting with candidates and selling them on the role. They are armed with more data to make decisions, which ultimately transforms their jobs. We’re seeing this transformation across many industries. It’s changing everyone’s job, though perhaps not in the ways one might assume.

AIM Media House: How do cultural differences in expressions, intonation, and vocabulary impact the effectiveness of hiring systems in identifying suitable candidates?

“These expressions are not universal; in some cultures, certain expressions are not widely recognized, and there are obviously different levels of expression among various cultures.”

Lindsey Zuloaga: We have experimented with it in the past, and what we observed was intriguing. When hiring for a customer service role, how one expresses themselves—whether through facial expressions or tone of voice—can indeed be important. There was early research suggesting some universality in facial expressions, but then Lisa Feldman Barrett’s work came out a few years ago, compiling numerous studies that indicated more nuance, especially around certain facial expressions. These expressions are not universal; in some cultures, certain expressions are not widely recognized, and there are obviously different levels of expression among various cultures.

We have always monitored our algorithms for bias against demographic groups. However, issues also arise with video settings, such as varying lighting conditions that may affect data quality depending on skin color. We’ve consistently checked to ensure that our scoring does not vary significantly among different groups. Ultimately, though, there was considerable concern about using video or tonal data. The vast majority of valuable information we were obtaining came from the language people used. With the advances in natural language processing over the last several years, we found that we were deriving most of our value from there. Often, what people say aligns well with their facial expressions, so in most cases, the video data wasn’t providing much additional information. Due to these concerns, we decided it was more trouble than it was worth and phased out the use of this data three or four years ago.

AIM Media House: Could you explain to our audience what the diversity validity dilemma is? Additionally, what inspired you to conduct research on this specific topic?

“It implies that tweaking an assessment to achieve certain outcomes can create a trade-off: maximizing certain traits may lead to diversity issues. Addressing this mathematically is quite intriguing.”

Lindsey Zuloaga: This paper appears in the Journal of Applied Psychology, which focuses on industrial-organizational (I/O) psychology. My team of data scientists works closely with a team of I/O psychologists, creating a synergistic relationship within our company. We are fortunate to have the resources and time to publish in academic journals, which is not very common in the industry.

The term “selection procedures,” widely used within I/O psychology, refers to methods used for hiring that are predictive of job performance but often present issues with bias, also known as adverse impact. The reasons for this can be complex. For example, certain measures of job performance might be influenced by subjective definitions of what constitutes performance, or they could relate to retention. The origins of the data can be uncertain, potentially influenced by factors such as socioeconomic status, leading to certain demographic groups scoring higher for various reasons.

Many of the most valid predictors of job performance in these assessments tend to exhibit group differences in outcomes. This issue is referred to as the “validity-diversity dilemma” in selection procedures. It implies that tweaking an assessment to achieve certain outcomes can create a trade-off: maximizing certain traits may lead to diversity issues. Addressing this mathematically is quite intriguing.

I/O psychologists have been considering these issues for a long time, but now we are applying more complex models and machine learning algorithms to what used to be simpler tasks. Our team that published this paper consists of half I/O psychologists and half data scientists, including a brother-sister duo, which adds a personal touch. The two first authors, one an I/O psychologist at HireVue and the other a data scientist there, bring together these two worlds to explore ways of optimizing to preserve both validity and diversity, aiming for the best of both worlds in this trade-off.

AIM Media House: Is it correct to say that the diversity validity dilemma in machine learning involves a trade-off between diversity and validity due to its mathematical nature?

“It is common that we have datasets where the things that strongly predict job outcomes also show demographic differences. Mathematically, there are effective methods to mitigate this.”

Lindsey Zuloaga: It depends on the dataset, but it is common with these types of datasets to encounter an anti-correlation. For example, there may be a correlation between a desirable job outcome and a split along demographic groups, or there might be some inherent bias in the data. This is particularly challenging when no additional data is available. This issue is also prevalent in IQ tests, which are often debated for the reasons certain races score higher. Similarly, with job performance data, complexities arise, such as higher turnover rates among women in certain jobs possibly due to an inhospitable environment. However, we do not want our algorithms to perpetuate these patterns. Mathematically, there are effective methods to disregard these biases.

You might have seen headlines criticizing hiring practices, such as those only hiring males based on past data. There are simplistic approaches, for example, identifying and ignoring words in resumes that correlate with being male. In our paper, we discuss the removal of predictors as a baseline approach we have been using for years. This involves examining each feature—particularly those based on neural net language models like BERT—which are complex and engineered but integral to the model. We assess each feature’s contribution to group differences using Cohen’s D and evaluate its predictive value. We then rank these features and iteratively remove those that contribute to bias and are less predictive, continually adjusting the model to reduce bias while maintaining a degree of predictive validity.

This paper introduces a new method that improves upon the traditional approach. Instead of manually adjusting the model, we add a term to the cost function that penalizes the model for group differences in outcomes. This allows for an automated adjustment of the penalty coefficient, enhancing both the fairness and predictive value of the model. We present side-by-side plots in the paper comparing the validity and diversity of the traditional and new methods, demonstrating that the latter, a multi-pendency optimization approach, is significantly more effective. This solution, simple yet elegant, addresses an age-old problem in data science and I/O psychology with larger datasets than typically used classically.

AIM Media House: As you work to understand and mitigate bias by adjusting the parameterization or features of your tools, you’re removing biases, which impacts validation but enhances diversity in your processes, correct?

“Yes, we tune the feature weights in a way that results in preserving the validity of the model while making the algorithm fairer and more effective overall.”

Lindsey Zuloaga: Yes, the first method I described involves completely removing features, which is fairly straightforward. The second method is more nuanced because it retains all features but adjusts their tuning. In this approach, we might identify a feature that is slightly biased in one direction, but it is counterbalanced by another feature. We tune the feature weights in a way that results in preserving the validity of the model while making the algorithm fairer and more effective overall. This method is not as straightforward to explain at first glance. However, it results in preserving the validity of the model while making the algorithm fairer and more effective overall.

AIM Media House: So two questions here; firstly, does adjusting parameters to mitigate bias affect the accuracy of the model, and is there a reliable method to definitively identify the right candidate, especially when comparing equally qualified individuals? Secondly, how are the weights in the model determined?

“We introduce a penalty into this process to encourage the algorithm not only to make accurate predictions but also to do so without perpetuating group differences.”

Lindsey Zuloaga: In terms of validity, which refers to how accurately we can predict outcomes that are not included in our training set, this process is inherently challenging in any machine learning scenario. When selecting an outcome, we might consider various factors such as job turnover, sales numbers, or scores from a game or assessment. Each measurement must be taken with a grain of salt. The more we can measure someone’s performance on the job with objective metrics, the better, but acquiring such data is increasingly difficult.

The optimization process within the algorithm is automatic. It explores a parameter space that may contain hundreds or thousands of variables, searching for an optimal point. This is akin to predicting the price of a house based on attributes like square footage, number of rooms, and the presence of a garage—except there’s no manual assignment of value, like giving two points for a garage. Instead, the algorithm autonomously determines the weights it needs to make accurate predictions based on the training data.

We introduce a penalty into this process to encourage the algorithm not only to make accurate predictions but also to do so without perpetuating group differences. By incorporating this penalty, the algorithm adjusts its weights to achieve both objectives, finding an optimal balance in the process.

AIM Media House: For instance, in the context of a machine learning algorithm used for detecting pneumonia in chest X-rays, incorrect predictions result in penalties that lead to adjustments in the algorithm. How is the penalty determined for such an algorithm in hiring scenarios, especially when candidates are equally qualified and there is no clear correct decision?

“ It’s not always possible to have ground truth feedback immediately.”

Lindsey Zuloaga: It’s difficult to say that in hiring, we often don’t receive immediate feedback on a specific candidate to feed directly back into the system, as you might in other scenarios. Often, whether a person is hired or not, we may not receive feedback on them to input into the same algorithm at a later time. We’re typically predicting outcomes like how well a person answered a question. Therefore, it’s not always possible to have ground truth feedback immediately. Over time, we may gather more training data, which we’ll use to retrain the model. We then test out of sample to see how well our model predicts, serving as an indicator of its performance on unseen data. In the scenario you described, there is no online learning or similar real-time feedback mechanisms in place.

AIM Media House: Given the subjectivity involved in assessing qualifications from AI-trained data on past interviews, how do you effectively determine if a candidate’s response genuinely indicates their suitability for the job?

“When constructing algorithms to score responses, we ensure that trained evaluators review the answers.”

Lindsey Zuloaga: In our paper, we use broad examples to illustrate how we utilize video interviews to assess job competencies, such as the ability to handle difficult customers or teamwork skills. As you pointed out, our I/O psychology team plays a crucial role in this process. They conduct a detailed job analysis to determine the necessary competencies for a customer service role, for instance. Based on their findings from interviewing current employees about their day-to-day tasks, they develop an assessment with specific questions tailored to these competencies.

When constructing algorithms to score responses, we ensure that trained evaluators review the answers. Each response is assessed by multiple reviewers using a clear rubric that defines what constitutes a one through five rating. If there is significant disagreement among the reviewers, further discussion is required. This rigorous curation of training data ensures clarity in what we are predicting and helps the user understand the assessment criteria.

Indeed, the optimal way to determine job suitability would be to observe an applicant performing the job for three months. However, an interview serves as a practical proxy. We focus on predicting how well candidates perform in these interviews because structured interviews have been demonstrated in the literature as reliable predictors of job performance. This connection is also substantiated later in our findings.

It is crucial to communicate clearly with our clients about what we are measuring: we assess how candidates discuss their ability to manage challenging situations, and we provide explicit definitions for each rating level, such as what a three out of five entails. This approach, while somewhat subjective, establishes a link between interview performance and real-world job performance, acknowledging the inherent gap that exists between the two.

AIM Media House: In hiring and other critical areas, while AI can streamline processes, human oversight is crucial. Where should we draw the line in the hiring process to balance AI involvement with ethical responsibilities, and what are the future trends for using AI in hiring?

“There are things that humans excel at and tasks where algorithms are more efficient.”

Lindsey Zuloaga: Yes, I think it’s certainly true that in many areas, the combination of humans and machines yields the best results. There are things that humans excel at and tasks where algorithms are more efficient. The question is, how do we bring the best of both worlds together? Our tools are designed to augment human capabilities, providing more data and information to enhance decision-making processes.

When humans use our interface, they encounter the most promising candidates at the top of the list—those you would want to review first out of potentially hundreds or thousands of applicants. Instead of allowing 90% of these candidates to simply disappear, we start with the most promising and work our way down until a suitable candidate is found. This structured approach not only speeds up the process but also makes it more efficient.

As someone who has been a job candidate, I understand how frustrating it can be to apply for a job and never receive a response. We aim to make the entire hiring process easier and quicker. However, it is crucial to recognize that humans still play a vital role in this process. Ultimately, humans make the final hiring decision, and face-to-face or screen-to-screen interviews remain essential steps after the initial assessments and pre-recorded interviews.

📣 Want to advertise in AIM Research? Book here >

Abhijeet Adhikari

Abhijeet Adhikari is a Research Associate at AIM-Research, focusing on AI and data science related research reports. Beyond his professional role, Abhijeet is an avid reader with a particular interest in historical and mythological facts, you can reach him at abhijeet.adhikari@aimresearch.co

Subscribe to our Latest Insights