2025 : 1 : 1
Samira Mavaddati

Samira Mavaddati

Academic rank: Assistant Professor
ORCID:
Education: PhD.
ScopusId:
HIndex: 0/00
Faculty: Faculty of Technology and Engineering
Address: University of mazandaran
Phone: 011-35305126

Research

Title
A Voice Activity Detection Algorithm Using Deep Learning in Time-Frequency Domain
Type
JournalPaper
Keywords
Voice activity detector, Time-frequency representation, ResNet-32, Convolutional neural network, Recurrent neural network
Year
2024
Journal Neural Computing and Applications (NCAA)
DOI
Researchers Samira Mavaddati

Abstract

Abstract: Voice Activity Detection (VAD) is an important component of signal processing that is critical for various applications, including speech recognition, speaker recognition, and speaker identification for example to eliminate different background noise signals. With the increasing use of deep learning techniques in speech-based applications, VAD has become more accurate and efficient. In this paper, a novel VAD method based on deep neural networks, specifically, a residual neural network (ResNet), a convolutional neural network (CNN) with multilayer perceptron (MLP), and recurrent neural network (RNN) components are proposed. The results of the proposed VAD are compared with other state-of-the-art methods, and its effectiveness is demonstrated in basic speech enhancement in the presence of various types of noise. The proposed VAD utilizes a combination of time-frequency features, such as log-Mel spectrogram representations, and the ResNet-32 as a flexible classifier method to detect speech/no-speech activities. The advantages of the proposed VAD method include its ability to adapt to different types of background noise, such as stationary, non-stationary, and periodic noises, and its flexibility in terms of selecting appropriate deep learning models for different applications. The results of the proposed VAD method show significant improvements compared to other methods, which demonstrates its effectiveness in speech enhancement in noisy environments. It is shown that the best performance is achieved by a hybrid of time-frequency features as log-Mel spectrogram representations and ResNet-32 as a flexible classifier method to detect speech/no-speech activities correctly.