Comparing CNNs and ViTs for Medical Image Classification Leveraging Transfer Learning

Lonia, G.; Ciraolo, D.; Fazio, M.; Villari, M.; Celesti, A.

doi:10.1109/ISCC61673.2024.10733732

In recent years, significant progress has been achieved in medical image analysis, mainly due to the substantial advances in deep learning methods. In the past decade, Convolutional Neural Network (CNN) was the best model for image classification, demonstrating remarkable success in various medical applications. However, the advent of Vision Transformers (ViTs) has challenged the dominance of CNN approaches. This study aims to explore the potential of ViTs in healthcare, comparing their performance with that of CNN models. The latter has traditionally excelled in image feature extraction through convolutional operations; on the other hand, ViTs, relying on self-attention mechanisms, exhibit unique capabilities in capturing long-range dependencies, enabling them to effectively capture complex patterns within images. In this study, after analyzing their architectures, we assessed the behaviour of from-scratch and pre-trained models, highlighting their differences in performance and providing light on the applicability of Transfer Learning (TL) approach in the healthcare scenario.