Font Size: a A A

Research On End-to-End Speech Recognition

Posted on:2021-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:C X QinFull Text:PDF
GTID:1368330623982171Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The speech recognition technology has been broadly used in a variety of civil and military fields,and the traditional speech recognition technology has been well developed.In recent years,with the proposal and development of end-to-end speech recognition,the speech recognition system gradually overcomes the disadvantages of the module-wise design and independent assumptions.This enables the model to optimize jointly and to be more suitable for deployment in hardware.End-to-end models have achieved state-of-the-art results in many tasks.Therefore,it has become a promising speech recognition technology and becomes a very popular research topic in speech recognition recently.End-to-end speech recognition is based on the deep learning technology,and is modeled by an integrated network.This has raised three main issues.Firstly,the training of end-to-end speech recognition demands a large amount of labeled data.Secondly,the mechanism of end-to-end model has much to optimize since it is trained in a data-driven manner.Thirdly,end-to-end speech recognition model is lack of interpretability since it uses deep networks for modeling.This dissertation focuses on the three key issues of end-to-end speech recognition,and the specific work is mainly reflected in four aspects.They are the low resource speech recognition,the speech recognition towards active learning and semi-supervised learning,the optimization of the model structure and the training algorithm,and the research on interpretability.The main contents of this dissertation are as follows:1.Towards end-to-end speech recognition with transfer learning.The end-to-end speech recognition model is lack of knowledge guidance due to the manner of data-driven training.In speech recognition tasks with limited training data,there is much potential for the model performances to be improved.This dissertation proposes a transfer learning-based end-to-end speech recognition method.Firstly,a novel high-level feature extraction method using transfer learning is proposed.Inspired by the idea of data augmentation,the multilingual training and the language adaptive training are first carried out,to transfer knowledge from Spanish,Italian,German,French,to English.The high-level features are extracted through convex nonnegative matrix factorization(CNMF)upon the trained neural network.The features become more robust and include better high-level semantic expressions.Secondly,based on,two types of joint end-toend models are built upon the high-level features.The first one is the joint CTC-attention model with non-shared encoders.The second one is the joint Multi-CTC Multi-resolution hierarchical attention model.By transferring the monotonic restraints of the CTC model and sharing the complementary information between different sub-models,the joint model could be enhanced under the low resource condition.Experiments show the superiority of the transfer learning-based method upon other methods in performance,and the best-performed model achieves the state-ofthe-art result in the TIMIT corpus.2.Towards end-to-end speech recognition with active learning and semi-supervised learning.To make use of the training data more efficiently and to ultimately reduce the dependence of the end-to-end model on the labeled data,this dissertation proposes a novel policy algorithm for evaluating unlabeled data to do data augmentation.In the context of the attentionbased speech recognition model,this algorithm is applied to both active learning tasks and semisupervised learning tasks.Specifically,a new theory of representing information of utterances is proposed,to calculate the average distance of each unlabeled utterance among the rest utterances.The N-best decoding probability is introduced to both the entropy of uncertainties and the exponential form of the expected average distances to obtain the final score of the algorithm.A variety of active learning and semi-supervised learning experimental results show the superiority of the proposed algorithm over other methods.Results also demonstrate that the average distance term plays a greater role in the policy score for tasks with more utterances to augment.3.Optimization of the model structure and the training algorithm for end-to-end speech recognition.The model structure and the training algorithm are lack of sufficient constraints in end-to-end speech recognition,which leads to the blindness of the model during training.Optimization methods of the model structure and the training algorithm are proposed in this dissertation.In the first aspect,to bring more sequential and long-term constraints to the attentionbased model,a multi-level-based attention mechanism is proposed to make the attention structures deeper.This method uses the dot-product of two adjacent layer outputs to replace the encoder outputs when calculating attention scores and uses the residual connection of adjacent layers to calculate attention context vectors.Furthermore,the multi-level attention is combined with the multi-head structure to expand the attention in breadth so that each attention head includes multilevel information.In the second aspect,to solve the problem of the inconsistency between the training objectives and the evaluation metrics,and at the same time to alleviate the overfitting of the model and the overconfidence of the predictions,the evaluation metric is introduced into the training objective.The evaluation metric regularized training criterion is proposed.In this method,the constant of the smoothing term in the label smoothing algorithm is replaced by the speech recognition evaluation metric,to create adaptive regularizations for smoothing.Experiments are categorized with structure optimization,training optimization,and comprehensive optimization.The experimental results on TIMIT,WSJ,and LibriSpeech show that the performance of the multilevel attention mechanism is significantly better than that of the conventional attention mechanism,and could be further improved combining with the multi-head structure.Besides,both the attention-based model and the Transformer perform significantly better with the proposed training objective function compared with the objective with or without the label smoothing method.The ultimate experimental results on the model using both the proposed attentions and the proposed training objective achieve the state-of-the-art end-to-end speech recognition results in both TIMIT and WSJ and obtain the best result within the attention-based model in LibriSpeech.4.Towards the interpretability of the attention-based model.As an import extension of end-to-end speech recognition,the attention-based model which relies on deep learning belongs to a "black box" theory.Neither the intermediate output nor the training process are lack of enough transparency and interpretability.This dissertation focuses on two types of explainable research in the context of the attention-based model.First,a visualization method of the encoder outputs is proposed.This is done by using t-distributed stochastic neighbor embedding(t-SNE)algorithm together with a novel frame-level forced alignment obtained by the attention weights and prior knowledge.Then,training dynamics of the model are analyzed in the phone-level through canonical correlation analysis(CCA)over the encoder output t-SNE embeddings which are segmented by phones.Experimental results show that the utterance is shaped into manifolds of symbols which sequentially appear in the ground truth label.A variety of comparisons are taken between different models based on visualizations.Experiments further reveal the properties of convergence among different types of phones and summarize their relationships with the corresponding recognition results.
Keywords/Search Tags:speech recognition, end-to-end, transfer learning, active learning, semi-supervised learning, structure optimization, training optimization, interpretability
PDF Full Text Request
Related items