Summary
Although existing large language models (LLMs) show promising performance on general domains, their capabilities in specialized medical domains still have significant limitations. First, current LLMs are pre-trained on massive amounts of general text data, lacking domain-specific knowledge required for medical question answering. To tackle this issue, we can conduct instruction-tuning by incorporating medical knowledge into LLMs. This allows leveraging both the general language understanding ability and medical commonsense knowledge. Additionally, generalizable prompt engineering techniques can be developed to guide LLMs for better multi-choice question reasoning in the medical domain. Second, due to the limited scale of available medical data, directly training large models on medical corpora faces the data scarcity challenge. A possible solution is transfer learning, which adapts a pre-trained general LLM to the medical domain by further fine-tuning on medical datasets. This can achieve comparable or even better performance than training from scratch while greatly reducing the data requirement. Finally, evaluating LLMs in specialized domains requires developing targeted evaluation benchmarks. Although some medical question answering datasets exist, comprehensive benchmarks for assessing LLMs on a variety of medical reasoning and question answering tasks are still lacking. Constructing such benchmarks with high-quality data and reliable evaluation metrics is crucial to motivate progress in this area. In summary, developing effective medical LLMs involves knowledge-injection using instruction-tuning, transfer learning from general models, and building targeted medical evaluation benchmarks. By incorporating medical knowledge into powerful yet data-efficient models and creating comprehensive evaluation frameworks, we can push the boundaries of LLMs to enhance their capabilities for multi-choice medical question answering.
We explored instruction-tuning for adapting LLMs to medical domain.Larger models incorporated domain knowledge; medical institutions can operate models without external services.
Published By:
Issey Sukeda - arXiv.org
2023
Cited By:
0
We propose CodeApex, a benchmark to evaluate programming abilities of large language models. GPT-4 shows the best performance, but still lags humans.
Published By:
Lingyue Fu - arXiv.org
2023
Cited By:
3
ChatGPT,a chatbot,passed nearly half medical exam cases showing need to change clinical reasoning assessments;add AI to medical education.
Published By:
E. Strong - medRxiv
2023
Cited By:
18
We evaluated three language models on a gastroenterology exam. GPT-4 performed better than the models on average.
Published By:
Shuhaib Ali - medRxiv
2023
Cited By:
0
GPT-4 significantly outperformed the older GPT-3.5 chatbot and top medical residents on a medical exam; GPT-4 provided rationale for most responses.
Published By:
R. S. Huang - JMIR Medical Education
2023
Cited By:
6
The performance of three large language models and three professional populations in answering ophthalmology questions was evaluated; GPT-3.5 and PaLM2 were below the master's level, GPT-4 showed a level comparable to attending physicians.
Published By:
J. Holmes - arXiv.org
2023
Cited By:
1
Recently, large language model (LLM) based artificial intelligence (AI) systems have demonstrated remarkable capabilities in natural language understanding and generation.We start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a medical beginner towards a medical expert (called AntGLM-Med-10B), which leverages a 3-stage optimization procedure, i.e., general medical knowledge injection, medical domain instruction tuning, and specific medical task adaptation.We collect and construct large-scale medical datasets for each stage of the optimization process. These datasets encompass various data types and tasks, such as question-answering, medical reasoning, multi-choice questions, and medical conversations. Specifically for multi-choice questions in the medical domain, we propose a novel Verification-of-Choice approach for prompting engineering, which significantly enhances the reasoning ability of LLMs. Remarkably, by combining the above approaches, our AntGLM-Med-10B model can outperform the most of LLMs on PubMedQA, including both general and medical LLMs, even when these LLMs have larger model size.
Published By:
Qiang Li - arXiv.org
2023
Cited By:
1
We investigate uses of large language models (LLMs) in hematology, assessing knowledge through hematology questions from the USMLE. We propose augmenting LLMs with retrieval of medical guidelines to eliminate incorrect information. Extracting information from documents could streamline decision making. We evaluated GPT 3.5-turbo and GPT-4. We tested 127 hematology question-answer pairs from USMLE hematology. GPT-3.5 accuracy was 63%; GPT-4 was 82%. We evaluated an information retrieval framework with 120 multiple-choice questions from WHO 2017 myeloid neoplasms and acute leukemia guidelines. Questions were assessed with a zero-shot approach (posing question and options to the model) and retrieval of three relevant extracts. Zero-shot GPT-3.5 accuracy was 51%; GPT-4 was 71%. With retrieval, GPT-3.5 accuracy rose to 86% and GPT-4 to 97%. LLMs show substantial hematology knowledge. Ensuring consistent, safe responses is critical for medical use. Information retrieval significantly improved reliability and accuracy by enabling more informed, appropriate answers. The concept was validated with WHO 2017 guidelines and can apply to any hematology documents. Leveraging LLMs could enhance clinical, educational, and research hematology work.
Published By:
Maria R Cervera - Blood
2023
Cited By:
0
CEA may be beneficial for some asymptomatic patients with carotid stenosis. Trials reported fewer strokes with CEA vs medical therapy, but differences were small, perioperative harms were not trivial. Three fair-to-good RCTs compared CEA+medical therapy vs medical therapy alone for asymptomatic carotid stenosis. Our meta-analyses of these found CEA reduced risk for perioperative stroke/death + ipsilateral stroke (4 fewer events/1000; 95% CI, -7 to -2), any stroke (5 fewer/1000; CI, -8 to -3), and non-perioperative ipsilateral stroke (3 fewer/1000; CI, -5 to -1). No difference in all-cause mortality. CIs were modestly wider in sensitivity analyses. ACST reported over half of non-perioperative strokes with CEA were disabling/fatal; proportional reduction like for any stroke. ACAS subgroup analysis found reduction for men but not women. Duplex ultrasonography can detect CAS, but measurement properties vary; systematic review found pooled sensitivity/specificity of 94%/92% for >60% stenosis. No studies directly evaluated screening. Despite uncertainty, CEA continues for some asymptomatic patients.
Published By:
D. Jonas - Annals of Internal Medicine
2014
Cited By:
96
Medical Teacher introduces concept maps linked to journal articles;maps visualize ideas, generate new understanding; feature has focus questions signaling map aspects, augmenting reader comprehension.
Published By:
D. Torre - Medical Teacher
2023
Cited By:
0