Font Size: a A A

Research On Encoder-Decoder Models Based Sequence Mapping Problems

Posted on:2021-02-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F HouFull Text:PDF
GTID:1368330602494255Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In sequence modeling and processing problems,it is important to learn how to gen-erate a sequence given another sequence.We call this type of problems as sequence-to-sequence mapping problems(or sequence mapping problems,for short).Both machine translation(MT)and automatic speech recognition(ASR)belong to this kind of prob-lem.Traditional methods divide the whole sequence mapping problem into several subproblems including hand-crafted features,alignments between sequences,external linguistic knowledge,and so on.These subtasks are modeled individually and then combined to produce target sequences.With the booming of deep learning,models of the subtasks were replaced with neural networks so that models like translation model,acoustic model and language model can benefit from the powerful modeling capabilities of deep learning.However,it is complicated to build and deploy systems with the tradi-tional approaches because much amount of expert knowledge is needed.Meanwhile,er-rors of each submodule are accumulated for the final combined model since submodules of the system are difficult to be jointly optimized.To address these issues,end-to-end methods are proposed recently and become very popular.As an end-to-end approach,encoder-decoder model converts an input sequence to an output sequence directly with-out any intermediate intervention.Therefore encoder-decoder models,also known as sequence-to-sequence models,have been widely applied to sequence-mapping tasks like machine translation and speech recognition.And they can yield comparable or even better performance than traditional methods.Although encoder-decoder models are attractive,problems like training efficiency and real-time recognition still need to be explored.Meanwhile,more suitable sequence-to-sequence architectures for spe-cific tasks are worth exploring.Therefore,in this thesis,we come up with several new encoder-decoder models to solve the sequence mapping problems of machine transla-tion and speech recognition.Firstly,recurrent neural networks(RNN)are commonly adopted as the basic com-ponents of encoder-decoder models,which brings in the temporal dependency restric-tion.As a consequence,it is time-consuming to train the model since items in a sequence can not be processed parallelly.In response to this,we present a sequence-to-sequence model by replacing the RNNs with feedforward sequential memory networks(FSMN)in both encoder and decoder,which enables the new architecture to encode the entire source sentence simultaneously.We also modify the attention module to make the de-coder generate outputs simultaneously during training.We achieve comparable results in machine translation task with about 2 times faster during training because of temporal independency in FSMN based encoder and decoder.Secondly,the attention mechanisms of conventional encoder-decoder models it-eratively perform a pass over the entire input sequence to get attention weight vector and generate output sequence,which makes them fail in streaming processing tasks like online speech recognition where output symbols are produced when the input sequence has only been partially observed.In response to this,we propose the gaussian predic-tion based attention and segment boundary detection directed attention,to enable the online recognition of encoder-decoder model.For the first attention mechanism,the alignment between output and input sequence is assumed to be a time-moving gaus-sian window with variable size,and the location and size of the window are decided by its mean and variance.At each attention step,the window's moving forward increment along time from the previous window center as well as the variance are predicted so that online speech recognition is realized.For the second attention mechanism,to utilize the segmental structure of speech,we propose a segment boundary detection directed atten-tion mechanism which splits the input speech into successive segments with detected boundaries so that different output symbols adaptively have different chunk sizes for aggregating information within the segment with soft attention.The experimental re-sults show that this attention mechanism achieves comparable online performance with state-of-the-art models.The segment boundary detection is formulated as a sequential decision making problem and is solved with RL algorithm,which validates the effec-tiveness of using reinforcement learning for speech recognition task.Finally,commonly used encoder-decoder models don't fully exploit the monotonic input-output relation in ASR and the short-term stationary of speech.And the diffused weighted sum of the input(in soft-attention)and the absence of accessing all possible alignment paths make the classification of aligned input statistically less interpretable.In response to this,we propose a sequence-to-sequence ASR model equipped with se-quential state modeling,which resembles the concept of state transitions in HMM.Our method explicitly models output state transition probability and output emission prob-ability,so that more flexible and interpretable monotonic alignment distribution can be derived.Experiments on TIMIT dataset show promising recognition performance.And the generated alignments and emission probabilities demonstrate the stepwise input-output mapping property of the proposed model.
Keywords/Search Tags:Encoder-Decoder Model, Neural Machine Translation, Non-Recurrent Structure, Online Speech Recognition, Reinforcement Learning, Sequential State Modeling
PDF Full Text Request
Related items