| Command word recognition is an important means of human-computer interaction,with the rapid development of embedded devices and the surge of intelligent upgrading of traditional electronic products,command word recognition has a broad development prospect in embedded devices such as smart home devices,wearable devices,and smart car devices.Current speech recognition algorithms have surpassed the human level of recognition ability under the condition of sufficient computing resources and quiet environments.However,embedded command word recognitions are limited by the low computing resources and small storage spaces of the devices,so it cannot use highprecision large models and complex algorithms.Moreover,there are common problems such as noise interference and far-field attenuation in the usage scenarios of embedded command word recognitions,and using the command word recognition models for regular conditions directly is less effective.In this context,this thesis investigates command word recognition methods to improve the performance of the embedded command word recognition in target scenarios.For the problem of poor performance of command word recognition under noise interference and far-field attenuation environments,this thesis proposes an acoustic model adaptive method based on data enhancement.On the basis of the TDNN model,a discriminative training criterion of lattice-free is introduced to convert the training of speech feature frames into training of phoneme sequences,and it is experimentally demonstrated that the WER is reduced by 11% on the airport broadcast dataset compared with the traditional training method based on the cross-entropy loss function.Exploiting the contextual information redundancy of the TDNN structure,a 1-state HMM topology and a frame subsampling mechanism are introduced to reduce the training and decoding computation,which reduces the decoding time consumption by 64% on the airport broadcast dataset compared with the traditional method without frame subsampling under the 3-state HMM structure.In the case of lack of real scene speech data,this thesis implements data enhancement-based scene adaptation on the basis of scene mismatch model,by adding real scene noise with multiple SNR,for instance,and combines it with front-end processing methods such as gain control and noise suppression.It is experimentally demonstrated that the WER can be reduced by up to 87% on the airport broadcast dataset.For the problem of model storage space limitation for embedded devices,this thesis proposes a command word recognition model compression method based on multiple network components.Based on a fully connected neural network,a TDNN-F network structure with a semi-orthogonal factor decomposition layer is introduced,and it is experimentally demonstrated that the method reduces the model storage space usage by62% with little loss of recognition accuracy.Based on this,TDNN-F networks with different model parameters are investigated in this thesis,and the effects of parameters such as the depth of the network,the dimension of the network,the number of PDFs and the dimension of the output layer on the recognition performance are analyzed.Based on the compressed model of TDNN-F,this thesis reduces the model space occupation by 50%through Int16 fixed-pointing for the Float32 format weights,and experimentally,the model has limited performance loss on the airport broadcast dataset.In this thesis,based on the compressed model,we conducted a study on the effect of different filler weights on recognition performance,and obtained two models with the sizes limited to 1MB and3 MB,respectively,with 79.6% recall and 5.4% FAR for the former and 79.0% recall and2.1% FAR for the latter. |