Font Size: a A A

Research On Speech Recognition Method Based On Feature Fusion Under Attention Mechanism

Posted on:2022-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:L X ZhouFull Text:PDF
GTID:2518306779491584Subject:Computer Hardware Technology
Abstract/Summary:PDF Full Text Request
Speech recognition is a key technology that enables machines to recognize human language.In speech recognition technology,the acoustic characteristics of speech and the speech recognition system are two important factors that determine the performance of speech recognition.In the research of speech recognition,the most frequently extracted acoustic feature is Mel Cepstrum Coefficient(MFCC),this type of feature is some buttom-level features extracted from the original speech,which usually contains redundant information,which will also interfere to a certain extent and affect the recognition performance.Research in recent years has shown that some top-level features based on sparse representation methods perform better than traditional features in the field of speech recognitions,such features have strong distinguishing characteristics.In speech recognition based on sparse representation,GMM-HMM and DNN-HMM classification models are mostly used,such models are complex to train and cannot use the context of speech data to assist current information.Voice data is complex and changeable,only a single feature parameter of the signal is extracted to represent the voice signal,this single feature cannot fully express the hidden information in the voice,and an excellent voice recognition model can better learn the voice information of the feature so as to improve the recognition performance.This article has carried out further research on the above-mentioned problems,which can be summarized in the following aspects:(1)Aiming at the problem that a single modal feature cannot fully express complex speech information,this thesis proposes three types of feature fusion structures to extract speech information from different angles,three different fusion features are extracted on the basis of buttom-level features and top-level sparse representation.Dictionary learning is an important step in the process of extracting sparse representations,when using a dictionary learning algorithm,training data is usually used as the initial sample to train the dictionary.When the training data is large,it will take a lot of time to learn the dictionary.In response to this problem,this article improves the training process of learning the dictionary.Choose a fixed amount of data from each category of speech data to form a small training data set as the initial sample to train the dictionary,and finally obtain the sparse representation of all speech data through the dictionary.Among the three types of fusion features extracted,the feature fusion structure of the bottom and top-level fully combines the effective information in the bottom-level features and the top-level features,and captures the semantic relevance between different feature patterns.Experiments show that the performance of the proposed fusion features is better than single feature,and the fusion feature(MSR)extracted from the fusion structure of the bottom layer and the top layer shows the best performance in the classification model.The constructed small training data as the initial sample to train the dictionary also greatly improves the dictionary learning speed,and the sparse representation extracted by the learned dictionary also has good performance for recognition.(2)For traditional speech recognition based on sparse representation,which has single features and complex classification model training,this thesis combine MSR with the bidirectional long short memory network(Bi LSTM)uder the attention mechanism,and then use the learning ability of the neural network and the weight distribution mechanism of the attention model.Using the computer system of the attention structure and the powerful ability of Bi LSTM to capture two-way semantics,it is possible to assign large weights to the features related to the current prediction to improve the recognition performance.The extracted fusion features contain more and strong distinguishing voice information,the influence of the features on the predicted value can be further improved when weight increases,then improve the performance of speech recognition.Experiments show that combining the MSR that combines the advantages of the buttom-level and top-level features with the attention mechanism model shows the best recognition effect.The phoneme recognition error rate is much lower than the traditional single feature and other types of fusion features,it has strong robustness.It further illustrates that for the input features that contain more effective information,combining them with the attention mechanism can strengthen the input's influence on the target output,thereby improving the performance and robustness of the speech recognition system.
Keywords/Search Tags:speech recognition, feature fusion, sparse representation, dictionary learning, attention mechanism
PDF Full Text Request
Related items