Research On Attention-Based End-to-End Speech Recognition

Posted on:2019-09-26

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Long

Full Text:PDF

GTID:2428330566970943

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Driven by big data and artificial intelligence technology,an end-to-end continuous speech recognition system was born.In this system,a sequence-to-sequence model based on a recurrent neural network is used to establish a direct connection between the input speech feature sequence and the output phoneme(or element)sequence.Compared with the traditional speech recognition system,the system has the advantages of simple structure,strong universality,and independent of linguistic knowledge.However,experiments show that the training of end-to-end speech recognition system requires more corpus,computing resources and time to achieve the performance of the traditional speech recognition system.Therefore,improving the end-to-end speech recognition system by adjusting the model structure and design algorithm is the current speech Research hotspots in the field of identification.This article introduces the end-to-end speech recognition system based on the connectionist temporal classification and the attentionbased "encoding-decoding" model(abbreviated as "attention model"),and builds on the baseline system,focusing on existing attention.The main problems in the model are to improve and innovate.The main tasks and innovations are as follows:1.Aiming at the problem of large scale of attention model parameters and slow convergence of parameters during training,the Gate Recurrent Unit(GRU)used by the recurrent neural network in the original model is replaced by a Minimal Gate Unit(MGU).The MGU is developed based on the simplified GRU structure.It has close timing modeling capability with GRU and contains fewer parameters.Therefore,after replacement,the parameter size of the attention model can be effectively reduced.The experimental results show that the attention model based on MGU can effectively reduce the training time compared with the original model with less performance loss.2.For the problem of inaccurate alignment between phoneme and feature in speech recognition system based on attention model,it proposes to use adaptive window function to limit the attention range,and to add the pool to the convolutional neural network which is characteristic of the computing system.Layer.The adaptive width window function can narrow the distribution of attention according to the actual pronunciation length of the phoneme so as to avoid attention from being distributed in the feature region that is not related to the current phoneme;the addition of the pooling layer can reduce the calculation coefficient introduced by the convolutional neural network.Noise interference.The experimental results show that the accuracy of the alignment of phonemes and features in the improved model recognition results is significantly improved,and the recognition accuracy of the system is also improved.3.Aiming at the problem of low recognition accuracy and many training iterations caused by lack of effective initialization parameters in attention model,a speech recognition system based on bottleneck feature extraction network was proposed.In this system,the bottleneck feature extraction network based on deep belief network is used as the front end of the system to provide more distinguishable and robust speech features for the back-end attention model,so that the attention model can be reduced through the layered layer of the recurrent neural network.The number further reduces the number of iterations and the number of parameters,and the robustness and accuracy of system identification are improved.Further,it proposes to use the link-time classification algorithm as the objective function to train the bottleneck feature extraction network and combine it with the attention model to achieve the integration of the two end-to-end models.The experimental results show that after the attention model is combined with the bottleneck feature extraction network,both the recognition accuracy and the training speed are significantly improved.

Keywords/Search Tags:

Speech Recognition, Attention-based Model, Minimal Gate Unit, Alignment, Bottleneck Feature

PDF Full Text Request

Related items

1	Research On End-to-End Speech Recognition Based On GRU And Self-Attention Mechanism
2	Speech Emotion Recognition Based On Attention Mechanism
3	Hardware Implementation Based On Low-resource Speech Recognition System
4	Chineses Speech Recognition System Based On CLDNN Hybrid Model
5	Research On Chinese Digit Speech Recognition System Based On HTK
6	Research On Speech Emotion Recognition Based On Deep Features Fusion And Joint Decision
7	Research On Tibetan Non-specific Continuous Speech Recognition Based On Deep Learning
8	Research On Speech Recognition In Noisy Environment
9	Text-Speech Alignment Based On General Speech Recognition
10	Research On Automatic Speech-Text Alignment For Mongolian Long Audio