Despite CNNs' high accuracy in medical image analysis, their opaque nature limits widespread clinical adoption, as practitioners are skeptical of predictions lacking clear rationale. This critical trust barrier necessitates the development of new approaches to provide transparent, reliable, and actionable insights coming from CNN, thus to effectively integrate them into healthcare processes. This paper addresses this issue by proposing a novel hybrid diagnostic pipeline that combines the predictive power of CNNs with the interpretive capabilities of Large Language Models (LLMs). Utilizing the LLM's ability to generate human-like text and drawing clinical reasoning, our solution generates transparent explanations for CNN-based diagnoses. The approach is demonstrated on retinal diseases, where a ConvNeXt V2 model and Contrastive Language-Image Pretraining (CLIP) feature extraction approach are integrated for clinical classification and interpretation. This hybrid Vision-Language strategy aims to deliver both high predictive accuracy and the necessary human-readable accountability to foster clinical trust.

Hybrid vision-language models for improved transparency in healthcare processes: The retinal diagnosis use case

La Rosa F.;Dell'Acqua P.;Fazio M.;Villari M.
2026-01-01

Abstract

Despite CNNs' high accuracy in medical image analysis, their opaque nature limits widespread clinical adoption, as practitioners are skeptical of predictions lacking clear rationale. This critical trust barrier necessitates the development of new approaches to provide transparent, reliable, and actionable insights coming from CNN, thus to effectively integrate them into healthcare processes. This paper addresses this issue by proposing a novel hybrid diagnostic pipeline that combines the predictive power of CNNs with the interpretive capabilities of Large Language Models (LLMs). Utilizing the LLM's ability to generate human-like text and drawing clinical reasoning, our solution generates transparent explanations for CNN-based diagnoses. The approach is demonstrated on retinal diseases, where a ConvNeXt V2 model and Contrastive Language-Image Pretraining (CLIP) feature extraction approach are integrated for clinical classification and interpretation. This hybrid Vision-Language strategy aims to deliver both high predictive accuracy and the necessary human-readable accountability to foster clinical trust.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11570/3351229
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact