AI DRIVEN CLINICAL DECISION SUPPORT SYSTEMS: A RETRIEVAL AUGMENTED GENERATION APPROACH FOR HEALTHCARE DELIVERY AND EFFICIENCY
DOI:
https://doi.org/10.48047/v90mq567Abstract
Clinical decision support systems have become increasingly important in modern healthcare, yet many language model-based approaches remain limited by unsupported responses, insufficient contextual grounding, and inadequate reliability for routine clinical use. To address these limitations, this study proposes a guideline-grounded retrieval-augmented generation framework that combines dense semantic retrieval with large-language model-based answer generation for healthcare question answering. The framework was developed using the epfl_llm/guidelines dataset, from which a large-scale retrieval corpus of 970,584 text chunks was constructed through systematic preprocessing, recursive text chunking, metadata preservation, embedding generation, and vector indexing in ChromaDB. Three embedding models, namely all-MiniLM-L6-v2, E5-base-v2, and BGE-base-v1.5, were evaluated alongside three language model configurations, including Phi-3 Mini, LLaMA 7B, and GPT-4o-mini, to assess retrieval effectiveness, answer relevance, contextual alignment, and inference efficiency across a manually curated set of 56 clinical questions. The results demonstrate that retrieval quality strongly influences final response quality, with BAAI/BGE-base-v1.5 achieving the highest retrieval performance across all ranking metrics. Furthermore, the RAG-based framework consistently outperformed direct language model generation across all lexical and semantic evaluation metrics, confirming the benefit of grounding generated responses in retrieved clinical evidence. A practical trade-off between response quality and inference latency was also observed across model configurations. These findings suggest that guideline-grounded retrieval-augmented generation is a promising, practically viable approach for developing more trustworthy, context-aware, and evidence-based clinical decision support systems.
Downloads
References
Kawamoto, K., Houlihan, C. A., Balas, E. A., & Lobach, D. F. (2005). Im- proving clinical practice using clinical decision support systems: a system- atic review of trials to identify features critical to success. Bmj, 330(7494), 765.
Sutton, R. T., Pincock, D., Baumgart, D. C., Sadowski, D. C., Fedorak,
R. N., & Kroeker, K. I. (2020). An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ digital medicine, 3(1), 17.
Van Baalen, S., Boon, M., & Verhoef, P. (2021). From clinical decision sup- port to clinical reasoning support systems. Journal of evaluation in clinical practice, 27(3), 520-528.
Susanto, A. P., Lyell, D., Widyantoro, B., Berkovsky, S., & Magrabi, F. (2023). Effects of machine learning-based clinical decision support systems on decision-making, care delivery, and patient outcomes: a scoping review. Journal of the American Medical Informatics Association, 30(12), 2050- 2063.
Labkoff, S., Oladimeji, B., Kannry, J., Solomonides, A., Leftwich, R., Koski, E., ... & Quintana, Y. (2024). Toward a responsible future: recom- mendations for AI-enabled clinical decision support. Journal of the Amer- ican Medical Informatics Association, 31(11), 2730-2739.
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.
Kanjee, Z., Crowe, B., & Rodman, A. (2023). Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama, 330(1), 78-80.
Hirosawa, T., Harada, Y., Tokumasu, K., Ito, T., Suzuki, T., & Shimizu, T. (2024). Comparative study to evaluate the accuracy of differential diagnosis lists generated by gemini advanced, gemini, and bard for a case report series analysis: cross-sectional study. JMIR Medical Informatics, 12, e63010.
Wang, C., Ong, J., Wang, C., Ong, H., Cheng, R., & Ong, D. (2024). Potential for GPT technology to optimize future clinical decision-making using retrieval-augmented generation. Annals of biomedical engineering, 52(5), 1115-1118.
Oniani, D., Wu, X., Visweswaran, S., Kapoor, S., Kooragayalu, S., Polan- ska, K., & Wang, Y. (2024, June). Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI) (pp. 694-702). IEEE.
Miao, J., Thongprayoon, C., Suppadungsuk, S., Garcia Valencia, O. A., & Cheungpasitporn, W. (2024). Integrating retrieval-augmented generation with large language models in nephrology: advancing practical applications. Medicina, 60(3), 445.
Jeong, M., Sohn, J., Sung, M., & Kang, J. (2024). Improving medical rea- soning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics, 40(Supplementc 1), i119-i129.
Alkhalaf, M., Yu, P., Yin, M., & Deng, C. (2024). Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. Journal of biomedical informat- ics, 156, 104662.
Shanafelt, T. D., Dyrbye, L. N., Sinsky, C., Hasan, O., Satele, D., Sloan, J., & West, C. P. (2016, July). Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. In Mayo clinic proceedings (Vol. 91, No. 7, pp. 836-848). Elsevier.
Moy, A. J., Schwartz, J. M., Chen, R., Sadri, S., Lucas, E., Cato, K. D., & Rossetti, S. C. (2021). Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. Journal of the American Medical Informatics Association, 28(5), 998-1008.
Wang, D., & Zhang, S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial intelli- gence review, 57(11), 299.
Ke, Y., Jin, L., Elangovan, K., Abdullah, H. R., Liu, N., Sia, A. T. H., ... & Ting, D. S. W. (2024). Development and testing of retrieval augmented generation in large language models–a case study report. arXiv preprint arXiv:2402.01733.
Ullah, E., Parwani, A., Baig, M. M., & Singh, R. (2024). Challenges and barriers of using large language models (LLM) such as ChatGPT for diag- nostic medicine with a focus on digital pathology–a recent scoping review. Diagnostic pathology, 19(1), 43.
Lu, Y., Zhao, X., & Wang, J. (2024, August). ClinicalRAG: Enhancing clinical decision support through heterogeneous knowledge retrieval. In Pro- ceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024) (pp. 64-68).
Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., ... & Bosselut, A. (2023). Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
