Font Size: a A A

Research On Speech Keyword Spotting Technology In Noisy Environments

Posted on:2022-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:P W YeFull Text:PDF
GTID:2518306764967179Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
As people's lifestyles changed,more and more hi-tec products appear in our lives,including voice-controlled smart products,which completely liberate people's hands compared to traditional touch control.However,most of the current research on speech keyword spotting confines the scene to a quiet environment,which ignores the various environmental noises in real-life scenes,so the existing excellent algorithms may not maintain good performance in noisy environments.Meanwhile,speech keyword spotting models are generally deployed in various embedded devices,which implies the requirements of low resources,small memory,and less computation.Therefore,in order to ensure the robustness of speech keyword spotting models in noisy environments while satisfying the low power consumption,the work in this thesis is as follows.1.To solve the problem of lack of noisy dataset,this thesis leverages several noises from real-life scenes to synthesize noisy dataset with different signal-to-noise ratios,including the VCTK dataset for speech enhancement models and the Google Speech Commands Dataset for speech keyword spotting.In order to denoise,the thesis trains a speech enhancement model based on SEWUNet,and applies several data enhancement methods,including spectral subband hybrid masking and spectral normalization.2.The thesis designs a joint training model,which consists of a speech enhancement model and a keyword spotting model.The speech enhancement model is used for improving the speech quality to make the keyword spotting model works better by modified SNR-based loss function,and a two-stage training method.Experiments are conducted based on different backbones,and the results demonstrate that the improved speech enhancement model effectively improves the speech quality,and the joint training method finally gets higher accuracy by 6.7% in noisy environments than other models without speech enhancement.3.To address the problem of high resource-consuming of speech enhancement models,the thesis proposes a noise-robust small-footprint speech keyword spotting model.The keyword spotting model carries out frequency-based data augment operations before training instead of independent speech enhancement model,and utilize a new backbone,which combines 2-D frequency-time convolution and 1-D temporal convolution to reduce the number of parameters and computation while preserving the feature extraction capability,and a Conv Att that integrates with the backbone in a relative way so that the model learns the important information.The proposed keyword spotting model is tested to improve the accuracy in noisy environment by 3.1% compared to the joint training model,while the number of parameters is only 1/300 of the joint training model.
Keywords/Search Tags:Noise-robust, Speech Enhancement, Joint Training, Dimension Transform Convolution, Relative Attention
PDF Full Text Request
Related items