general purpose large language models for multi choice medical question

Summary

Top 10 papers analyzed

Although existing large language models (LLMs) show promising performance on general domains, their capabilities in specialized medical domains still have significant limitations. First, current LLMs are pre-trained on massive amounts of general text data, lacking domain-specific knowledge required for medical question answering. To tackle this issue, we can conduct instruction-tuning by incorporating medical knowledge into LLMs. This allows leveraging both the general language understanding ability and medical commonsense knowledge. Additionally, generalizable prompt engineering techniques can be developed to guide LLMs for better multi-choice question reasoning in the medical domain. Second, due to the limited scale of available medical data, directly training large models on medical corpora faces the data scarcity challenge. A possible solution is transfer learning, which adapts a pre-trained general LLM to the medical domain by further fine-tuning on medical datasets. This can achieve comparable or even better performance than training from scratch while greatly reducing the data requirement. Finally, evaluating LLMs in specialized domains requires developing targeted evaluation benchmarks. Although some medical question answering datasets exist, comprehensive benchmarks for assessing LLMs on a variety of medical reasoning and question answering tasks are still lacking. Constructing such benchmarks with high-quality data and reliable evaluation metrics is crucial to motivate progress in this area. In summary, developing effective medical LLMs involves knowledge-injection using instruction-tuning, transfer learning from general models, and building targeted medical evaluation benchmarks. By incorporating medical knowledge into powerful yet data-efficient models and creating comprehensive evaluation frameworks, we can push the boundaries of LLMs to enhance their capabilities for multi-choice medical question answering.

JMedLoRA: Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning

We explored instruction-tuning for adapting LLMs to medical domain.Larger models incorporated domain knowledge; medical institutions can operate models without external services.

Published By:

Issey Sukeda - arXiv.org

2023

Cited By:

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

We propose CodeApex, a benchmark to evaluate programming abilities of large language models. GPT-4 shows the best performance, but still lags humans.

Published By:

Lingyue Fu - arXiv.org

2023

Cited By:

Performance of ChatGPT on free-response, clinical reasoning exams

ChatGPT,a chatbot,passed nearly half medical exam cases showing need to change clinical reasoning assessments;add AI to medical education.

Published By:

E. Strong - medRxiv

2023

Cited By:

General purpose large language models match human performance on gastroenterology board exam self-assessments.

We evaluated three language models on a gastroenterology exam. GPT-4 performed better than the models on average.

Published By:

Shuhaib Ali - medRxiv

2023

Cited By:

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

GPT-4 significantly outperformed the older GPT-3.5 chatbot and top medical residents on a medical exam; GPT-4 provided rationale for most responses.

Published By:

R. S. Huang - JMIR Medical Education

2023

Cited By:

Evaluating Large Language Models in Ophthalmology

The performance of three large language models and three professional populations in answering ophthalmology questions was evaluated; GPT-3.5 and PaLM2 were below the master's level, GPT-4 showed a level comparable to attending physicians.

Published By:

J. Holmes - arXiv.org

2023

Cited By:

From Beginner to Expert: Modeling Medical Knowledge into General LLMs

Recently, large language model (LLM) based artificial intelligence (AI) systems have demonstrated remarkable capabilities in natural language understanding and generation.We start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a medical beginner towards a medical expert (called AntGLM-Med-10B), which leverages a 3-stage optimization procedure, i.e., general medical knowledge injection, medical domain instruction tuning, and specific medical task adaptation.We collect and construct large-scale medical datasets for each stage of the optimization process. These datasets encompass various data types and tasks, such as question-answering, medical reasoning, multi-choice questions, and medical conversations. Specifically for multi-choice questions in the medical domain, we propose a novel Verification-of-Choice approach for prompting engineering, which significantly enhances the reasoning ability of LLMs. Remarkably, by combining the above approaches, our AntGLM-Med-10B model can outperform the most of LLMs on PubMedQA, including both general and medical LLMs, even when these LLMs have larger model size.

Published By:

Qiang Li - arXiv.org

2023

Cited By:

Assessment of Artificial Intelligence Language Models and Information Retrieval Strategies for QA in Hematology

We investigate uses of large language models (LLMs) in hematology, assessing knowledge through hematology questions from the USMLE. We propose augmenting LLMs with retrieval of medical guidelines to eliminate incorrect information. Extracting information from documents could streamline decision making. We evaluated GPT 3.5-turbo and GPT-4. We tested 127 hematology question-answer pairs from USMLE hematology. GPT-3.5 accuracy was 63%; GPT-4 was 82%. We evaluated an information retrieval framework with 120 multiple-choice questions from WHO 2017 myeloid neoplasms and acute leukemia guidelines. Questions were assessed with a zero-shot approach (posing question and options to the model) and retrieval of three relevant extracts. Zero-shot GPT-3.5 accuracy was 51%; GPT-4 was 71%. With retrieval, GPT-3.5 accuracy rose to 86% and GPT-4 to 97%. LLMs show substantial hematology knowledge. Ensuring consistent, safe responses is critical for medical use. Information retrieval significantly improved reliability and accuracy by enabling more informed, appropriate answers. The concept was validated with WHO 2017 guidelines and can apply to any hematology documents. Leveraging LLMs could enhance clinical, educational, and research hematology work.

Published By:

Maria R Cervera - Blood

2023

Cited By:

Screening for Asymptomatic Carotid Artery Stenosis: A Systematic Review and Meta-analysis for the U.S. Preventive Services Task Force

CEA may be beneficial for some asymptomatic patients with carotid stenosis. Trials reported fewer strokes with CEA vs medical therapy, but differences were small, perioperative harms were not trivial. Three fair-to-good RCTs compared CEA+medical therapy vs medical therapy alone for asymptomatic carotid stenosis. Our meta-analyses of these found CEA reduced risk for perioperative stroke/death + ipsilateral stroke (4 fewer events/1000; 95% CI, -7 to -2), any stroke (5 fewer/1000; CI, -8 to -3), and non-perioperative ipsilateral stroke (3 fewer/1000; CI, -5 to -1). No difference in all-cause mortality. CIs were modestly wider in sensitivity analyses. ACST reported over half of non-perioperative strokes with CEA were disabling/fatal; proportional reduction like for any stroke. ACAS subgroup analysis found reduction for men but not women. Duplex ultrasonography can detect CAS, but measurement properties vary; systematic review found pooled sensitivity/specificity of 94%/92% for >60% stenosis. No studies directly evaluated screening. Despite uncertainty, CEA continues for some asymptomatic patients.

Published By:

D. Jonas - Annals of Internal Medicine

2014

Cited By:

Concepts in health professions education: Using the lens of concept mapping for further understanding. A new feature for Medical Teacher

Medical Teacher introduces concept maps linked to journal articles;maps visualize ideas, generate new understanding; feature has focus questions signaling map aspects, augmenting reader comprehension.

Published By:

D. Torre - Medical Teacher

2023

Cited By: