In a recent study published within the journal Scientific Reports, researchers evaluated the performance of Generative Pre-trained Transformer-4 (GPT-4) and ChatGPT in the US (US) Medical Licensing Examination (USMLE) soft skills.
Artificial intelligence (AI) is being increasingly utilized in medical practice. Large language models (LLMs), comparable to GPT-4 and ChatGPT, have drawn considerable scientific attention, with multiple studies assessing their performance in medicine. Although LLMs have been proficient in various tasks, their performance in areas that need human judgment and empathy is yet to be investigated.
The USMLE measures cognitive acuity, medical knowledge, ability to navigate complex scenarios, patient safety, and (skilled, ethical, and legal) judgments. The USME Step 2 Clinical Skills, the usual test for interpersonal and communication skill evaluation, was discontinued as a consequence of the coronavirus disease 2019 (COVID-19) pandemic. Nevertheless, the core clinical communication elements have been integrated into other steps of the USMLE.
The USMLE Step 2 Clinical Knowledge (CK) scores predict performance across performance domains, comparable to communication, professionalism, teamwork, and patient care. Artificial cognitive empathy is an emerging field of interest. Understanding the capability of AI to accurately perceive and reply to patients’ emotional states can be particularly relevant in patient-centered care and telemedicine.
Study: Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Image Credit: Tex vector / Shutterstock
Concerning the study
In the current study, researchers assessed GPT-4 and ChatGPT performance in USMLE questions involving human judgment, empathy, and other soft skills. They used 80 questions designed to fulfill USMLE requirements, compiled from two sources. The primary source was the USMLE sample questions for Step 1, Step 2, CK, and Step 3, available on its official website.
Sample test questions were screened, and 21 questions were chosen, which require professionalism, interpersonal and communication skills, cultural competence, leadership, organizational behavior, and legal/ethical issues. Questions that require medical or scientific knowledge weren’t chosen.
Fifty-nine Step 1-, Step 2 CK-, and Step 3-type questions were identified from the second source, AMBOSS, a matter bank for college kids and medical practitioners. The AI models were tasked with answering all questions. The prompt structure comprised the query text and multiple-choice answers.
After the models responded, they were followed up with: “Are you sure?” to check the soundness and consistency of the model and trigger potential re-evaluation of its initial answers. If the models revised their answers, it would indicate some uncertainty. The performance of the AI models and humans was compared using AMBOSS user performance statistics.
Findings
The general accuracy of ChatGPT was 62.5%. It was 66.6% accurate for the USMLE sample test and 61% for AMBOSS questions. GPT-4 showed superior performance, achieving an overall accuracy of 90%. GPT-4 answered the USMLE sample test with 100% accuracy; nevertheless, its accuracy for AMBOSS questions was 86.4%. No matter whether the initial response was correct, GPT-4 never modified its response when prompted to re-evaluate its initial answer.
ChatGPT revised its initial responses for 82.5% of the questions when prompted. When ChatGPT modified initial incorrect responses, it rectified the error, producing correct answers 53.8% of the time. The user statistics of AMBOSS revealed that the mean rate of correct responses was 78% for the precise questions utilized in this study. Comparatively, ChatGPT had a lower performance than humans, but GPT-4 showed higher performance, achieving 61% and 86.4% accuracy, respectively.
Conclusions
In sum, the researchers tested the performance of AI models, GPT-4 and ChatGPT, on questions of the USLME soft skills, including judgment, ethics, and empathy. Each models accurately answered most questions. Nonetheless, GPT -4’s performance was superior to ChatGPT, because it accurately answered 90% of the questions in comparison with 62.5% accuracy for ChatGPT. Unlike ChatGPT, GPT-4 showed confidence in its answers and never revised its original response.
However, ChatGPT demonstrated confidence in 17.5% of questions. The findings show that LLMs produce impressive leads to questions testing the soft skills required by physicians. They indicate that GPT-4 is more able to effectively tackling questions requiring professionalism, ethical judgment, and empathy. The inclination of ChatGPT to revise its initial responses might suggest a design emphasis on flexibility and adaptableness, favoring diverse interactions.
In contrast, the consistency of GPT-4 could indicate its robust sampling mechanism or training predisposed to stability. Furthermore, GPT-4 also surpassed human performance. Notably, the mechanism for re-evaluation applied on this study may not reflect human cognitive understanding of uncertainty because AI models operate in line with calculated probabilities relatively than human-like confidence.