| Speech keyword spotting(KWS)is a critical technology of human-computer interaction.It pays attention to a small segment of the audio stream and is usually served as the intelligent startup interface of devices.Only when users send specific instructions or words the complex modules will be triggered and processed accordingly,which makes devices could run for a long time with a low-power standby state.In recent years,with the rapid development of deep learning,the performance of KWS systems has been dramatically improved.However,it still faces many challenges,such as data imbalance,low efficiency of sample utilization,slow training,and so on.This thesis studies these problems,and the specific contributions are as follows:1.Based on the number-of-errors guided re-weighted loss function to alleviate the impact of data imbalance.The problem of data imbalance is common in KWS training,where a large amount of diverse negative training samples that may have pronunciation similar to the keyword are indeed required to reduce false alarms.Simultaneously,it is easy to collect abundant negative training data,while it is expensive to collect positive keyword data.During training,a large number of easy-trained negative samples overwhelm the loss and dominate the gradient backpropagation,resulting in a degenerated model.To deal with it,this thesis proposes a novel re-weighted loss.It evaluates the sample importance by its number of detection errors during training and automatically down-weights the contribution of easy examples,the majorities of which are negatives,making the training focus on samples deserving more training.The proposed method can alleviate the imbalance naturally while efficiently using all available data.Evaluation of several sets of keywords selected from AISHELL-1 and AISHELL-2achieves 16%—38% relative reductions in false rejection rates over standard loss at 0.5 false alarms per keyword per hour in experiments.2.Proposes a sample utilization strategy based on class uncertainty to improve deep learning efficiency.In the conventional training paradigm of deep learning,all data point contributes equally regardless of the underlying distribution,and each participates in the training throughout.It is not an efficient process since the “learning difficulty” of training samples varies for the model in different training stages.A large number of easy examples that can be correctly identified by the model without much effort participate in the training without restriction,resulting in a waste of computing resources and reducing the training efficiency.Similar to the “Three Zones” theory of human cognition: “Only by choosing activities in the learning zone can one make progress,” this thesis proposes an effective learning mechanism that focuses on the use of “Learning Zone” samples during training to improve the efficiency of training.By constructing the probability of samples utilization according to the output information of the model in a feedback manner,the proposed method focuses on the untrained samples that are close to the decision boundary,and removes a large number of samples that have relatively no learning significance for the current model in the middle and late stages of training,that is,it uses subset for training,which improves the pertinence.Several KWS experiments on Google Speech Command dataset show that the proposed method reduce the training time of the original approach by 59.47%—64.86%,while the accuracy only decreases by about 1%.The experiment of image classification task based on CIFAR-10 further verifies the effectiveness.When the accuracy is only relatively reduced by 0.85%,the training time is reduced by 65.07%. |