Hybrid vision-language models for improved transparency in healthcare processes: The retinal diagnosis use case

La Rosa, F.; Dell'Acqua, P.; Fazio, M.; Villari, M.

doi:10.1016/j.knosys.2026.115456

Despite CNNs' high accuracy in medical image analysis, their opaque nature limits widespread clinical adoption, as practitioners are skeptical of predictions lacking clear rationale. This critical trust barrier necessitates the development of new approaches to provide transparent, reliable, and actionable insights coming from CNN, thus to effectively integrate them into healthcare processes. This paper addresses this issue by proposing a novel hybrid diagnostic pipeline that combines the predictive power of CNNs with the interpretive capabilities of Large Language Models (LLMs). Utilizing the LLM's ability to generate human-like text and drawing clinical reasoning, our solution generates transparent explanations for CNN-based diagnoses. The approach is demonstrated on retinal diseases, where a ConvNeXt V2 model and Contrastive Language-Image Pretraining (CLIP) feature extraction approach are integrated for clinical classification and interpretation. This hybrid Vision-Language strategy aims to deliver both high predictive accuracy and the necessary human-readable accountability to foster clinical trust.