VGGISH FOR MUSIC/SPEECH CLASSIFICATION IN RADIO BROADCASTING

Serrano, S.; Scarpa, M. L.; Serghini, O.

In the realm of audio signal processing, distinguishing between music and speech poses a significant challenge due to the nuanced similarities and complexities inherent in both domains. This study delves into this challenge by employing deep learning techniques to classify audio segments as either music or speech. Our approach involves utilizing the VGGish architecture and Mel-spectrograms as input to provide a rich rep resentations of audio signals. These representations serve as inputs to our classification models, enabling us to discern intricate patterns characteristic of music and speech. We explore the efficacy of our models in this classification task, particularly focusing on their performance in various windowed audio segments. Through rigorous experimentation and evaluation, we observe notable results. Models exhibit remarkable accuracy, exceeding 96% in distinguishing between music and speech. These findings underscore the effectiveness of deep learning models in discerning between music and speech. This work contributes to the understanding of deep learning applications in audio signal processing.