Wed Dec 28 2022
Mon Dec 26 2022

Large Language Models Encode Clinical Knowledge

Medicine
Natural Language Processing, Machine Learning
Medical question answering, clinical knowledge and reasoning, AI in medicine

The paper presents MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries, to evaluate large language models' (LLMs) clinical knowledge. It proposes a framework for human evaluation of model answers along multiple axes, and evaluates PaLM and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset, but human evaluation reveals key gaps in Flan-PaLM responses. The paper introduces instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars, and shows that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.

LLMs have the potential utility in medicine, but current models have limitations in clinical applications. The framework and approaches presented in this paper can help evaluate and improve LLMs for clinical knowledge and reasoning, and facilitate the implementation of AI in medical processes and workflows.

Sun Dec 25 2022
Thu Dec 22 2022
Wed Dec 21 2022
Tue Dec 20 2022