Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination

Yudai Kaneda; Ryo Takahashi; Uiri Kaneda; Shiori Akashima; Haruna Okita; Sadaya Misaki; Akimi Yamashiro; Akihiko Ozaki; Tetsuya Tanimoto

doi:10.7759/cureus.42924

Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination

Cureus. 2023 Aug 3;15(8):e42924. doi: 10.7759/cureus.42924. eCollection 2023 Aug.

Authors

Yudai Kaneda¹, Ryo Takahashi², Uiri Kaneda³, Shiori Akashima⁴, Haruna Okita⁵, Sadaya Misaki⁶, Akimi Yamashiro⁷, Akihiko Ozaki⁸, Tetsuya Tanimoto⁹

Affiliations

¹ College of Medicine, Hokkaido University, Hokkaido, JPN.
² Department of Rehabilitation Medicine, Sonodakai Joint Replacement Center Hospital, Tokyo, JPN.
³ Department of Foreign Languages, Dokkyo University, Saitama, JPN.
⁴ Department of Obstetrics and Gynecology, Shonan Kamakura General Hospital, Kanagawa, JPN.
⁵ College of Medicine, Tokyo Women's Medical University, Tokyo, JPN.
⁶ Department of Rehabilitation Medicine, Sonoda Daiichi Hospital, Tokyo, JPN.
⁷ Department of Nutrition Science, Shokei Gakuin University, Miyagi, JPN.
⁸ Department of Breast and Thyroid Surgery, Jyoban Hospital of Tokiwa Foundation, Fukushima, JPN.
⁹ Department of Internal Medicine, Navitas Clinic, Kanagawa, JPN.

Abstract

Purpose The purpose of this study was to evaluate the changes in capabilities between the Generative Pre-trained Transformer (GPT)-3.5 and GPT-4 versions of the large-scale language model ChatGPT within a Japanese medical context. Methods The study involved ChatGPT versions 3.5 and 4 responding to questions from the 112th Japanese National Nursing Examination (JNNE). The study comprised three analyses: correct answer rate and score rate calculations, comparisons between GPT-3.5 and GPT-4, and comparisons of correct answer rates for conversation questions. Results ChatGPT versions 3.5 and 4 responded to 237 out of 238 Japanese questions from the 112th JNNE. While GPT-3.5 achieved an overall accuracy rate of 59.9%, failing to meet the passing standards in compulsory and general/scenario-based questions, scoring 58.0% and 58.3%, respectively, GPT-4 had an accuracy rate of 79.7%, satisfying the passing standards by scoring 90.0% and 77.7%, respectively. For each problem type, GPT-4 showed a higher accuracy rate than GPT-3.5. Specifically, the accuracy rates for compulsory questions improved from 58.0% with GPT-3.5 to 90.0% with GPT-4. For general questions, the rates went from 64.6% with GPT-3.5 to 75.6% with GPT-4. In scenario-based questions, the accuracy rates improved substantially from 51.7% with GPT-3.5 to 80.0% with GPT-4. For conversation questions, GPT-3.5 had an accuracy rate of 73.3% and GPT-4 had an accuracy rate of 93.3%. Conclusions The GPT-4 version of ChatGPT displayed performance sufficient to pass the JNNE, significantly improving from GPT-3.5. This suggests specialized medical training could make such models beneficial in Japanese clinical settings, aiding decision-making. However, user awareness and training are crucial, given potential inaccuracies in ChatGPT's responses. Hence, responsible usage with an understanding of its capabilities and limitations is vital to best support healthcare professionals and patients.

Keywords: ai & robotics healthcare; chatgpt; chatgpt-3.5; chatgpt-4; clinical applications; gpt-3.5; gpt-4; japan; national nursing examination.