Font Size: a A A

Sequence Modeling And Decoding In Speech Recognition

Posted on:2020-05-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H ChenFull Text:PDF
GTID:1488306218989479Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
A unique phenomenon in human speech is the variable lengths in acoustic waves and linguistic words.Hence automatic speech recognition(ASR)requires both pattern classification and state alignment modeling between input and output sequences,called sequence prediction problem.In the training stage,the model takes acoustic features as the input and its labeling as the output,where sequence modeling and pattern classification are two keys,which determine the upper bound of a speech recognizer.In the inference stage,a speech recognizer is to find a sequence of labels whose corresponding acoustic and language models best match the input feature,called decoding,which determines the recognition speed and precision in real application.The most recent milestone of ASR is the application of deep neural networks(DNN)in acoustic and language modeling.However,those successful applications are still based on the traditional formulation of speech recognition and only aim to improve the pattern classification above.In this thesis,the remaining sequence modeling and decoding problems are systematically investigated in the modern DNN based ASR.We propose sequence modeling solutions of unconventional ASR tasks,keyword spotting(KWS)and overlapped speech recognition,for the first time.Traditional sequence modeling research is usually conducted on the acoustic modeling of conventional speech recognition systems.Although keyword spotting and overlapped speech recognition are both inherently sequence prediction problems,they have not benefited from sequence modeling due to the lack of proper criteria and the difficulties of getting alternative sequence hypotheses for discriminative training.We propose to solve these problems in the lattice-free discriminative training framework.Namely,the competing hypotheses are efficiently modeled by phoneme or sub-word level language models.Moreover,this framework takes speaker tracing and speech separation errors in overlapped speech into account.We propose a specific discriminative training formulation for overlapped speech recognition which also penalizes competing outputs from the overlapped speech.Our sequence modeling solutions achieve significant improvements in both unconventional ASR tasks,and show the potential to be combined with transfer learning and joint training.For decoding,we propose algorithmic speedups in two folds: parallelizing Viterbi search algorithm and decreasing algorithm complexity by label synchronous decoding.Firstly,we propose the parallel Viterbi search algorithm,implement it in GPU and make it open-sourced,achieving great speedups.Since most of the weighted finite state transducers(WFSTs)arcs in ASR are independent with each other,the search algorithm has the potential to be parallelized.However,the algorithm and implementation naturally work in serial and the previous rare trials have many limitations.We propose a series of solutions and redesign the algorithm:token recombination as an atomic GPU operation in order to reduce synchronization overheads;dynamic load balancing strategy for more efficient token passing scheduling among GPU threads;the redesign of exact lattice generation and lattice pruning algorithms for better GPU utilization.Experiments on the switchboard corpus show that the proposed method achieves identical ASR precision,while running 3 to 15 times faster.Additionally we obtain a 46-fold speedup with sequence parallelism and multi-process service(MPS)in GPU.Secondly,based on confusion blank symbol modeling,we systematically propose label synchronous decoding(LSD)to transform the search process from frame level to label level and obtain significant speedups.The dominant decoding method nowadays is frame synchronous Viterbi beam search whose algorithm complexity is linear with the length of the acoustic waves.We propose to transform the search process above from frame level to label level whose complexity is linear with the length of linguistic words.Namely,we utilize effective blank structure and apply efficient post-processing of blank during inference before doing Viterbi search.The proposed framework can be applied to both generative and discriminative sequence models.Experiments on the switchboard corpus show 2-4 times speedup in search without performance deterioration.Moreover,significantly better quality of phone and word level lattices can be obtained from LSD methods above,called LSD lattices.We further improve a series of ASR systems based on LSD lattices,including keyword spotting,unified confidence measure framework,and acoustic-to-word(A2W)end-to-end modeling.In KWS,we introduce phoneme level confusion in inference stage by an efficient minimum edit distance post-processing upon CTC lattices,to improve the precision and robustness.We propose two distinct confidence measure algorithms based on CTC lattices,showing prevalent quality.We propose auxiliary normalization graph upon CTC lattices and take it as a unified search space for confidence measures in variant ASR applications.We utilize modular training to improve A2 W modeling and make it better to utilize external knowledge sources.A LSD based joint training strategy is proposed to obtain better modeling precision and faster inference speed.In conclusion,this thesis successfully improve the sequence modeling and decoding frameworks in the modern DNN based ASR.Tremendous speedups upon current speech recognizers can be obtained by combining the proposed parallel Viterbi search algorithm and label synchronous decoding.A low power-consumption and high quality unified ASR system can be built upon our works in sequence training and inference framework.The system is verified in variant datasets and some algorithms in the system are open-sourced.
Keywords/Search Tags:Speech Recognition, Sequence Modeling, Decoding, Parallel Computing, Label Synchronous Decoding, Confidence, Sequence Discriminative Training
PDF Full Text Request
Related items