Training Large Language Models (LLMs) with in-domain data can significantly enhance their performance, leading to more accurate and reliable question-answering (QA) systems essential for supporting clinical decision-making and educating patients.
This study introduces LLMs trained on in-domain, well-curated ophthalmic datasets. We also present an open-source substantial ophthalmic language dataset for model training. Our LLMs (EYE-Llama), first pre-trained on an ophthalmology-specific dataset, including paper abstracts, textbooks, EyeWiki, and Wikipedia articles. Subsequently, the models underwent fine-tuning using a diverse range of QA datasets. The LLMs at each stage were then compared to baseline Llama 2, ChatDoctor, and ChatGPT (GPT3.5) models, using four distinct test sets, and evaluated quantitatively (Accuracy, F1 score, and BERTScore) and qualitatively by two ophthalmologists.
Upon evaluating the models using the American Academy of Ophthalmology (AAO) test set and BERTScore as the metric, our models surpassed both Llama 2 and ChatDoctor in terms of F1 score and performed equally to ChatGPT, which was trained with 175 billion parameters (EYE-Llama: 0.57, Llama 2: 0.56, ChatDoctor: 0.56, and ChatGPT: 0.57). When evaluated on the MedMCQA test set, the fine-tuned models demonstrated a higher accuracy compared to the Llama 2 and ChatDoctor models (EYE-Llama: 0.39, Llama 2: 0.33, ChatDoctor: 0.29). However, ChatGPT outperformed EYE-Llama with an accuracy of 0.55. When tested with the PubmedQA set, the fine-tuned model showed improvement in accuracy over both the Llama 2, ChatGPT, and ChatDoctor models (EYE-Llama: 0.96, Llama 2: 0.90, ChatGPT: 0.93, ChatDoctor: 0.92).
The study shows that pre-training and fine-tuning LLMs like EYE-Llama enhances their performance in specific medical domains. Our EYE-Llama models surpass baseline Llama 2 in all evaluations, highlighting the effectiveness of specialized LLMs in medical QA systems. (Funded by NEI R15EY035804 (MNA) and UNC Charlotte Faculty Research Grant (MNA).)