Font Size: a A A

Study On Speech Wake-up Word Detection Methods Based On Deep Learning

Posted on:2024-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q C YuFull Text:PDF
GTID:2568307076998019Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Speech wake-up word detection technology serves as the entrance to speech interaction technology,by detecting the user’s speech stream to identify specific predefined vocabulary.Nowadays,significant progress has been made in this field,and a variety of software and hardware products have been promoted in application scenarios such as human-computer interaction,mobile speech assistants,smart speakers,smart headphones,and smart homes.Of particular note is the rise of deep learning theory in recent years,which has led to new breakthroughs in speech wake-up word detection technology based on pure neural networks,resulting in better performance.However,due to limitations in device computing resources and memory space,the performance of speech wake-up word detection in complex scenarios still needs improvement.This study focuses on applying deep learning theory to improve the performance of speech wake-up word detection in complex scenarios,while reducing the system parameter count.To achieve this goal,the main work carried out in this paper is as follows:(1)To address the issue of intra-class differences in wake-up words caused by variations in speech speed,which can affect the detection performance,this study proposes a speech speed normalization method.This method achieves speech speed normalization by adjusting wakeup word features of different lengths to the same length.To implement this method,two SERes2 Net networks are introduced as detection modules to construct a two-stage speech wakeup word detection system.Firstly,one SE-Res2 Net network is used to detect the local information of the wake-up word and determine the candidate segment of the wake-up word in the audio stream.Then,the candidate segment is normalized for speech speed,and the other SE-Res2 Net network is used to detect the global information of the wake-up word from the candidate segment.Finally,the detection results of the two networks are fused to determine whether the wake-up word is triggered.Experimental results show that on the Mobvoi dataset,the proposed two-stage speech wake-up word detection system achieved relative error rejection rate reductions of 45% and 44% on two wake-up words,respectively,compared to the best baseline,with a parameter count of only 88 K,which is 41% less than the best baseline.(2)To improve the performance of speech wake-up word detection while reducing model complexity,this study introduces the Ghost module into the SE-Res2 Net network and proposes the Ghost-SE-Res2 Net network to replace the SE-Res2 Net network in the two-stage speech wake-up word detection system.The Ghost module reduces model parameters by generating ghost features.Compared to the SE-Res2 Net network,the Ghost-SE-Res2 Net network reduces the number of parameters by 27%.Further optimization of speech wake-up word detection performance is achieved by replacing the global average pooling layer with an attention pooling layer.The attention pooling layer assigns different weights to input features based on their importance,allowing the model to focus on critical information.Experimental results show that the Ghost-SE-Res2Net-based system achieves at least a 12% reduction in false reject rate compared to the SE-Res2Net-based system on the Mobvoi dataset.This paper proposes a two-stage speech wake-up word detection method based on speech speed normalization using deep learning theory.The method demonstrates excellent performance and a simple framework,achieving good recognition results on the Mobvoi dataset in realistic complex scenarios.Furthermore,by introducing the Ghost module,the method not only reduces model complexity but also improves performance.Additionally,this method can be fine-tuned and applied to various low-resource devices.
Keywords/Search Tags:speech wake-up word detection, speech rate normalization, Res2Net, Ghost module, two-stage method
PDF Full Text Request
Related items