AI DRIVEN CLINICAL DECISION SUPPORT SYSTEMS: A RETRIEVAL AUGMENTED GENERATION APPROACH FOR HEALTHCARE DELIVERY AND EFFICIENCY

Abdur Rahman  Lindon; Hafiz Aziz Khan; Nusrat Yasmin Nadia; Habibor Rahman  Rabby; Md Habibul Arif

doi:10.48047/v90mq567

Authors

Abdur Rahman Lindon Department of Information Technology, Washington University of Science and Technology, 2900 Eisenhower Ave, Alexandria, VA 22314, USA Author
Hafiz Aziz Khan Department of Information Technology, Washington University of Science and Technology, 2900 Eisenhower Ave, Alexandria, VA 22314, USA Author
Nusrat Yasmin Nadia Department of Information Technology, Washington University of Science and Technology, 2900 Eisenhower Ave, Alexandria, VA 22314, USA Author
Habibor Rahman Rabby Department of Computer Science, Campbellsville University, 2300 Greene Way #100, Louisville, KY 40220, USA Author
Md Habibul Arif Department of Information Technology, Washington University of Science and Technology, 2900 Eisenhower Ave, Alexandria, VA 22314, USA Author

DOI:

https://doi.org/10.48047/v90mq567

Abstract

Clinical decision support systems have become increasingly important in modern healthcare, yet many language model-based approaches remain limited by unsupported responses, insufficient contextual grounding, and inadequate reliability for routine clinical use. To address these limitations, this study proposes a guideline-grounded retrieval-augmented generation framework that combines dense semantic retrieval with large-language model-based answer generation for healthcare question answering. The framework was developed using the epfl_llm/guidelines dataset, from which a large-scale retrieval corpus of 970,584 text chunks was constructed through systematic preprocessing, recursive text chunking, metadata preservation, embedding generation, and vector indexing in ChromaDB. Three embedding models, namely all-MiniLM-L6-v2, E5-base-v2, and BGE-base-v1.5, were evaluated alongside three language model configurations, including Phi-3 Mini, LLaMA 7B, and GPT-4o-mini, to assess retrieval effectiveness, answer relevance, contextual alignment, and inference efficiency across a manually curated set of 56 clinical questions. The results demonstrate that retrieval quality strongly influences final response quality, with BAAI/BGE-base-v1.5 achieving the highest retrieval performance across all ranking metrics. Furthermore, the RAG-based framework consistently outperformed direct language model generation across all lexical and semantic evaluation metrics, confirming the benefit of grounding generated responses in retrieved clinical evidence. A practical trade-off between response quality and inference latency was also observed across model configurations. These findings suggest that guideline-grounded retrieval-augmented generation is a promising, practically viable approach for developing more trustworthy, context-aware, and evidence-based clinical decision support systems.

Downloads

Download data is not yet available.

References

Kawamoto, K., Houlihan, C. A., Balas, E. A., & Lobach, D. F. (2005). Im- proving clinical practice using clinical decision support systems: a system- atic review of trials to identify features critical to success. Bmj, 330(7494), 765.

Sutton, R. T., Pincock, D., Baumgart, D. C., Sadowski, D. C., Fedorak,

R. N., & Kroeker, K. I. (2020). An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ digital medicine, 3(1), 17.

Van Baalen, S., Boon, M., & Verhoef, P. (2021). From clinical decision sup- port to clinical reasoning support systems. Journal of evaluation in clinical practice, 27(3), 520-528.

Susanto, A. P., Lyell, D., Widyantoro, B., Berkovsky, S., & Magrabi, F. (2023). Effects of machine learning-based clinical decision support systems on decision-making, care delivery, and patient outcomes: a scoping review. Journal of the American Medical Informatics Association, 30(12), 2050- 2063.

Labkoff, S., Oladimeji, B., Kannry, J., Solomonides, A., Leftwich, R., Koski, E., ... & Quintana, Y. (2024). Toward a responsible future: recom- mendations for AI-enabled clinical decision support. Journal of the Amer- ican Medical Informatics Association, 31(11), 2730-2739.

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.

Kanjee, Z., Crowe, B., & Rodman, A. (2023). Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama, 330(1), 78-80.

Hirosawa, T., Harada, Y., Tokumasu, K., Ito, T., Suzuki, T., & Shimizu, T. (2024). Comparative study to evaluate the accuracy of differential diagnosis lists generated by gemini advanced, gemini, and bard for a case report series analysis: cross-sectional study. JMIR Medical Informatics, 12, e63010.

Wang, C., Ong, J., Wang, C., Ong, H., Cheng, R., & Ong, D. (2024). Potential for GPT technology to optimize future clinical decision-making using retrieval-augmented generation. Annals of biomedical engineering, 52(5), 1115-1118.

Oniani, D., Wu, X., Visweswaran, S., Kapoor, S., Kooragayalu, S., Polan- ska, K., & Wang, Y. (2024, June). Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI) (pp. 694-702). IEEE.

Miao, J., Thongprayoon, C., Suppadungsuk, S., Garcia Valencia, O. A., & Cheungpasitporn, W. (2024). Integrating retrieval-augmented generation with large language models in nephrology: advancing practical applications. Medicina, 60(3), 445.

Jeong, M., Sohn, J., Sung, M., & Kang, J. (2024). Improving medical rea- soning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics, 40(Supplementc 1), i119-i129.

Alkhalaf, M., Yu, P., Yin, M., & Deng, C. (2024). Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. Journal of biomedical informat- ics, 156, 104662.

Shanafelt, T. D., Dyrbye, L. N., Sinsky, C., Hasan, O., Satele, D., Sloan, J., & West, C. P. (2016, July). Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. In Mayo clinic proceedings (Vol. 91, No. 7, pp. 836-848). Elsevier.

Moy, A. J., Schwartz, J. M., Chen, R., Sadri, S., Lucas, E., Cato, K. D., & Rossetti, S. C. (2021). Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. Journal of the American Medical Informatics Association, 28(5), 998-1008.

Wang, D., & Zhang, S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial intelli- gence review, 57(11), 299.

Ke, Y., Jin, L., Elangovan, K., Abdullah, H. R., Liu, N., Sia, A. T. H., ... & Ting, D. S. W. (2024). Development and testing of retrieval augmented generation in large language models–a case study report. arXiv preprint arXiv:2402.01733.

Ullah, E., Parwani, A., Baig, M. M., & Singh, R. (2024). Challenges and barriers of using large language models (LLM) such as ChatGPT for diag- nostic medicine with a focus on digital pathology–a recent scoping review. Diagnostic pathology, 19(1), 43.

Lu, Y., Zhao, X., & Wang, J. (2024, August). ClinicalRAG: Enhancing clinical decision support through heterogeneous knowledge retrieval. In Pro- ceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024) (pp. 64-68).

Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., ... & Bosselut, A. (2023). Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.