In the realm of audio signal processing, distinguishing between music and speech poses a significant challenge due to the nuanced similarities and complexities inherent in both domains. This study delves into this challenge by employing deep learning techniques to classify audio segments as either music or speech. Our approach involves utilizing the VGGish architecture and Mel-spectrograms as input to provide a rich rep resentations of audio signals. These representations serve as inputs to our classification models, enabling us to discern intricate patterns characteristic of music and speech. We explore the efficacy of our models in this classification task, particularly focusing on their performance in various windowed audio segments. Through rigorous experimentation and evaluation, we observe notable results. Models exhibit remarkable accuracy, exceeding 96% in distinguishing between music and speech. These findings underscore the effectiveness of deep learning models in discerning between music and speech. This work contributes to the understanding of deep learning applications in audio signal processing.

VGGISH FOR MUSIC/SPEECH CLASSIFICATION IN RADIO BROADCASTING

Serrano S.
Primo
;
Scarpa M. L.
Secondo
;
Serghini O.
Ultimo
2024-01-01

Abstract

In the realm of audio signal processing, distinguishing between music and speech poses a significant challenge due to the nuanced similarities and complexities inherent in both domains. This study delves into this challenge by employing deep learning techniques to classify audio segments as either music or speech. Our approach involves utilizing the VGGish architecture and Mel-spectrograms as input to provide a rich rep resentations of audio signals. These representations serve as inputs to our classification models, enabling us to discern intricate patterns characteristic of music and speech. We explore the efficacy of our models in this classification task, particularly focusing on their performance in various windowed audio segments. Through rigorous experimentation and evaluation, we observe notable results. Models exhibit remarkable accuracy, exceeding 96% in distinguishing between music and speech. These findings underscore the effectiveness of deep learning models in discerning between music and speech. This work contributes to the understanding of deep learning applications in audio signal processing.
2024
978-3-937436-84-5
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11570/3300940
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact