| Voice activity detection(VAD)is a very important preprocessing technology in speech signal processing.Its goal is to determine the voice part and the non-voice part from the speech signal to facilitate the subsequent processing,e.g.automatic speech recognition,speaker recognition.With the increasing popularity of artificial intelligence and human-computer interaction,speech recognition becomes a very important task in speech signal processing and has a very wide range of application prospects.Therefore,the improvement of voice activity detection technology has also been valued by many researchers.As the first step of the speech recognition system,the effect of voice activity detection is crucial.This thesis introduces the most representative traditional voice activity detection algorithms,such as double-threshold method,variance method and spectral entropy method.For these traditional signal processing methods,we can see that in the case of higher SNR speech,we can get a better voice activity detection effect by adjusting the parameters and thresholds.However,experiments have found that in different types of speech or in different noise environments,the robustness of these algorithms is poor.In view of the above problems,the deep neural network(DNN)and convolutional neural network(CNN)are used as models,combined with the multi-channel voice signal collected by the microphone array,After the speech signal features of the single-channel,dual-channel,and five-channel are extracted,they are used as input to the classification model,and comparative experiments are conducted in this thesis.The CHIME3 voice data set was used in this experiment.The noise environments are four scenes of bus,cafe,pedestrian area,and street that are common in daily life.Comparing the experimental results shows that the multi-channel speech signal collected by the microphone arrays is used as an input,the classification effect of the DNN model and the CNN model can be effectively improved,and the classification effect of the neural network model has certain robustness to different types of noise environments. |