Font Size: a A A

Research On Speech Keyword Spotting Technology Supporting Custom

Posted on:2024-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:M DuFull Text:PDF
GTID:2568307079454404Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Key Word Spotting(KWS)is a task that detects the presence of specific vocabulary in audio.It has gained increasing attention and development as a simple and direct means of human-computer interaction.However,with the popularity of smart devices,the detection of predefined keywords is no longer sufficient to meet the diverse and personalized usage needs of consumers.Custom keywords detection has therefore become a priority.Due to the scarcity of training samples for custom keywords,most existing algorithms for custom keywords detection rely on large neural networks based on phoneme classification,either locally or in the cloud.However,this is inconsistent with people’s demand for privacy and the trend of miniaturization and low power consumption of devices.Furthermore,supporting only Mandarin detection cannot meet the practical situation of numerous dialects in China.To address these issues,this paper designs a neural network-based dynamic template matching algorithm,which enables real-time detection of custom keywords in an offline manner based on extensive research on previous methods.The main contributions of this paper are as follows:First,to enhance the feature extraction capability of the neural network for speech signals,a combination of convolutional and recurrent layers is designed as a deep feature extractor,taking advantage of the temporal and spectral characteristics of the Fbank features used as network input.By properly designing the size of the convolutional kernels,phoneme-level length features can be better extracted.The recurrent layers then facilitate the correlation of features between phonemes,thereby enhancing the effectiveness of feature representation.Second,in order to reduce the influence of speech difference of speakers at different moments and improve the generalization of registration template,the attention mechanism is introduced after the depth feature extractor.It calculates the similarity between registration and test templates in real-time to obtain attention scores,which are used to dynamically update the registration templates,thereby improving the effectiveness of keyword detection.Compared to direct detection using features from deep feature extractor,the accuracy is improved by 5.6%.Finally,the accuracy for 11-class custom keywords detection on the GSCD dataset reaches 91.56%,and on a self-made Chinese speech keywords dataset,it achieves an accuracy of 91.33%,and the FRR at 1FA/hour was 7.80%.Third,in order to enable the system to be deployed on resource-constrained and power-limited terminal devices,a voice activity detection module is designed to reduce the dynamic power consumption of the system during idle periods to 5% of the working state.Various optimization techniques are employed on the hardware side to reduce resource usage and system latency,achieving reduced resource occupancy and power consumption.Ultimately,the system is deployed on the FPGA development board based on Xilinx A7 chip with a total power consumption of 0.229 W at 10 MHz clock,and the response latency is approximately 31 ms.
Keywords/Search Tags:Speech Keyword Spotting, Custom Keywords, Neural Network
PDF Full Text Request
Related items