Large Language Models Encode Clinical Knowledge
The paper presents MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries, to evaluate large language models' (LLMs) clinical knowledge. It proposes a framework for human evaluation of model answers along multiple axes, and evaluates PaLM and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset, but human evaluation reveals key gaps in Flan-PaLM responses. The paper introduces instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars, and shows that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
LLMs have the potential utility in medicine, but current models have limitations in clinical applications. The framework and approaches presented in this paper can help evaluate and improve LLMs for clinical knowledge and reasoning, and facilitate the implementation of AI in medical processes and workflows.