Reliability Study

Improving Psychometric Interview Assessments with Large Language Models

Psychometric tests are essential for hiring due to their many advantages, including equitable candidate assessment, forecasting future work performance, creating a character profile, spotting leadership potential, and enhancing the onboarding and retention of candidates. LLM based evaluators are designed to gauge personality, however, are not reliable diagnostic tools. Here, we outline the procedures and findings used to assess the reliability of LLM based hiring, which includes resume screening and video interviews and has several important potential advantages for businesses, including increased productivity, lower costs, and higher-calibre hires. Test-retest reliability and internal consistency are examined in this study to assess the dependability of an LLM based hiring system, which includes video interviews and resume screening. An Intraclass Correlation Coefficient (ICC) of 0.955 was obtained from test-retest study involving, demonstrating good stability over time. Cronbach's alpha scores for ten latent variables ranged from 0.902 to 0.979, indicating strong internal reliability. The results imply that LLM based evaluators can give standardized, objective, and bias-resistant assessments that are comparable to traditional approaches in terms of consistency and reliability, improving the hiring process as a whole.

Methodology

Through the use of conversational AI and psycholinguistic analysis, the reliability of a big language model for evaluating personality traits was thoroughly examined using a systematic method. The purpose of this study was to determine the consistency and reliability of the model while addressing the biases and static interaction forms that are inherent in standard psychometric assessments.

In order to measure and statistically analyse dependability metrics across several sessions and assessors, a quantitative study design was selected. This design made it easier to evaluate the model's performance objectively. A wide sample of candidates was chosen for the study in order to guarantee that it represented a cross-section of possible job seekers. During the interviews, the model assessed each candidate's behavioural fit for a particular position. Test-retest reliability, internal consistency were the two main reliability indicators that guided the data gathering process. Together, these measures provide a thorough assessment of the consistency and dependability of the model.

The LLM-based AI interviewer system utilizes advanced language models to interpret candidate responses

Test-Retest

Examining test-retest reliability - a measure of the AI's consistency across time was the first step in the investigation. The approach started with meticulous planning to specify the goals of the study and choose a representative sample of interview subjects. To make sure that variations in the model’s evaluations could be ascribed to the test-retest situation rather than variations in candidate responses, consistency in input responses was essential.

The Evaluator was used for the first round of interviews, during which each candidate's answers were assessed. The scores or qualitative ratings were carefully documented. The second round of interviews was planned to act as the retest after a suitable break. To ensure uniformity, the same candidate was interviewed in the same settings. To guarantee alignment with the original test data, the assessments from the second session were once more recorded. The information was then prepared for analysis by organizing and cleaning it. To keep the dataset intact, all missing values were imputed with the mean value. After the prepared data was loaded into SPSS, a two-way mixed-effects model with an emphasis on absolute agreement was used to obtain the Intraclass Correlation Coefficient (ICC). Since the objective is to evaluate the consistency of these ratings over time, this model was selected since it is acceptable when the same rater in this case, the Evaluator is used for all measurements.

Overview

Test-retest reliability and internal consistency are examined in this study to assess the dependability of an LLM based hiring system, which includes video interviews and resume screening. An Intraclass Correlation Coefficient (ICC) of 0.955 was obtained from test-retest study involving, demonstrating good stability over time. Cronbach's alpha scores for ten latent variables ranged from 0.902 to 0.979, indicating strong internal reliability. The results imply that LLM based evaluators can give standardized, objective, and bias-resistant assessments that are comparable to traditional approaches in terms of consistency and reliability, improving the hiring process as a whole.

Test-Retest Reliability

An extraordinarily high value of 0.955 for the Intraclass Correlation Coefficient (ICC) was obtained from the test-retest reliability analysis. This number represents a higher degree of consistency in the LLM-powered assessments of applicants between two different testing sessions that were held two weeks apart. This kind of ICC score greatly exceeds the cut-off point for exceptional reliability, which is typically defined as an ICC value of more than 0.9.

The ICC value of 0.955 has numerous, extremely beneficial ramifications. Above important, it validates the LLM platform's robustness in evaluating individuals' credentials, abilities, and general suitability in an extremely trustworthy way. This level of dependability guarantees the stability of the evaluator's assessments over time, offering a solid basis for making consistent and well-informed employment decisions. Furthermore, the great degree of consistency between the test and retest results indicates that the conditions or external variables that could otherwise cause inconsistent candidate assessments have little effect on the evaluator’s system. The integrity and dependability of the evaluation process depend on this resilience against outside influences, which guarantees repeatable and reliable outcomes.

Fig 2 This graph shows the deviation of Openness score over multiple test-retest. The value of it ranges between 0-100.
Fig 3 This graph shows the deviation of Conscientiousness score over multiple test-retest. The value of it ranges between 0-100.
Fig 4 This graph shows the deviation of Extraversion score over multiple test-retest. The value of it ranges between 0-100.
Fig 5 This graph shows the deviation of Agreeableness score over multiple test-retest. The value of it ranges between 0-100.
Fig 6 This graph shows the deviation of Neuroticism score over multiple test-retest. The value of it ranges between 0-100.

With such a high ICC value, the evaluator’s process is more credible, which strengthens the technology's potential as a useful tool in a range of assessment scenarios. The LLM's judgments are stable and reliable, which emphasizes their suitability for practical applications and longitudinal studies where regular review over time is essential.

Fig 7 Shows the Intraclass correlation score for the test-retest performed on the candidates interview data.

Conclusion

With an ICC of 0.955, the LLM-based evaluator exhibits strong reliability, reinforcing its potential as a useful tool in a variety of assessment contexts. The stability and reliability of its assessments make the technology suitable for practical applications and longitudinal studies, where consistent review over time is essential.

Last updated