An Approach to Enable Human–3D Object Interaction Through Voice Commands in an Immersive Virtual Environment

Catalfamo, Alessio; Celesti, Antonio; Fazio, Maria; Saif, A. F. M. Saifuddin; Lin, Yu-Sheng; Silva, Edelberto Franco; Villari, Massimo

doi:10.3390/bdcc9070188

Nowadays, the Metaverse is facing many challenges. In this context, Virtual Reality (VR) applications allowing voice-based human–3D object interactions are limited due to the current hardware/software limitations. In fact, adopting Automated Speech Recognition (ASR) systems to interact with 3D objects in VR applications through users’ voice commands presents significant challenges due to the hardware and software limitations of headset devices. This paper aims to bridge this gap by proposing a methodology to address these issues. In particular, starting from a Mel-Frequency Cepstral Coefficient (MFCC) extraction algorithm able to capture the unique characteristics of the user’s voice, we pass it as input to a Convolutional Neural Network (CNN) model. After that, in order to integrate the CNN model with a VR application running on a standalone headset, such as Oculus Quest, we converted it into an Open Neural Network Exchange (ONNX) format, i.e., a Machine Learning (ML) interoperability open standard format. The proposed system demonstrates good performance and represents a foundation for the development of user-centric, effective computing systems, enhancing accessibility to VR environments through voice-based commands. Experiments demonstrate that a native CNN model developed through TensorFlow presents comparable performances with respect to the corresponding CNN model converted into the ONNX format, paving the way towards the development of VR applications running in headsets controlled through the user’s voice.