Speech Emotion Recognition (SER) is an important area of research at the intersection of artificial intelligence and human-computer interaction, with useful applications such as AI-based assistants and autonomous driving, among others. Traditional SER techniques found in the literature are usually based on deep learning models and rely on feature engineering techniques to reduce the amount of data and improve classification accuracy. In this paper, a novel autoencoder-based method for performing SER is proposed and validated on the first Italian dataset EMOVO. Specialized autoencoders are used to reconstruct global features of a specific emotion. The classification task is then performed by selecting the smallest reconstruction error from the specialized autoencoders. The proposed approach utilises a lightweight, scalable architecture that enables accurate emotion recognition with lower computational complexity, thus significantly reducing training time if compared with traditional approaches. These results underline the potential of this method to improve emotion-based applications in the Italian language context.

Autoencoder-based architecture for Italian Speech Emotion Recognition

Patane, Luca;Maio, Antonino;Serrano, Salvatore;Sapuppo, Francesca;Xibilia, Maria Gabriella
2025-01-01

Abstract

Speech Emotion Recognition (SER) is an important area of research at the intersection of artificial intelligence and human-computer interaction, with useful applications such as AI-based assistants and autonomous driving, among others. Traditional SER techniques found in the literature are usually based on deep learning models and rely on feature engineering techniques to reduce the amount of data and improve classification accuracy. In this paper, a novel autoencoder-based method for performing SER is proposed and validated on the first Italian dataset EMOVO. Specialized autoencoders are used to reconstruct global features of a specific emotion. The classification task is then performed by selecting the smallest reconstruction error from the specialized autoencoders. The proposed approach utilises a lightweight, scalable architecture that enables accurate emotion recognition with lower computational complexity, thus significantly reducing training time if compared with traditional approaches. These results underline the potential of this method to improve emotion-based applications in the Italian language context.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11570/3338351
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact