Font Size: a A A

Research On Automatic Speech-Text Alignment For Mongolian Long Audio

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:M J NiuFull Text:PDF
GTID:2428330620976427Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Automatic Speech Recognition(ASR)system based on deep learning has been widely applied in various fields,and the acoustic model is trained on the large-scale speech database.However,at present,Mongolian speech database are relatively small,which can not meet the requirements of Mongolian large vocabulary continuous speech recognition system.Therefore,it is urgent to expand the Mongolian speech corpus.Manually recorded speech database not only costs a lot of manpower and material resources,but also has differences from actual application scenarios.In the era of big data,the Mongolian long speech and corresponding transcriptions can be obtained from internet and relevant institutions.And these resources help to expand the Mongolian speech database.Aiming at Mongolian TV drama audio,the thesis studies the Mongolian speech-text alignment methods based on ASR technology.The main contents and innovations of the thesis is described below:First of all,aiming at the speech-text alignment for Mongolian TV drama audio,the thesis realizes the automatic segmentation of Mongolian TV drama audio and improves the dialogue segmentation algorithm.The thesis uses Voice Activity Detection based on Double-Thresholding to delete mute parts in the audio.And the Hidden Markov models are built in order to detect and delete the Social Signals information that appears frequently in the Mongolian spoken dialogue.And then,the thesis segments dialogue based on Bayesian distance matrix.The experiments show that the False Detection Rate of dialogue segmentation based on Bayesian distance matrix is 4.22% lower than that of the traditional dialogue segmentation based on Bayesian information.Secondly,the thesis proposes the speech-text alignment algorithm based on the intermediate code RNN language model adaptation.The algorithm converts all Mongolian words into intermediate code and trains a general RNN language model.Then the RNN language model is fine-tuned using dramatic texts.At meanwhile,the LDA feature is connected to the RNN network to generate a topic-related adaptive RNN language model.After speech recognition using the new RNN language model,Every word in ASR results and dramatic texts is divided into stem and suffix.The algorithm will discard suffixes and leave stems.The stem is the unit of subsequent alignment.Compared with the baseline system,the proposed alignment algorithm based on the intermediate code RNN language model adaptation improves the Recall by 7.95% and the F-score by 4.88%.The alignment performance is further improved.At last,the thesis proposes the speech-text alignment algorithm based on the phone confusion matrix.The speech is decoded by the acoustic model to generate the phone sequence and the phone sequence of the dramatic texts is generated by G2 P model.At the same time,the thesis extracts a part of the speech to calculate the Mongolian phone confusion matrix.According to the confusion matrix,Levenshtein alignment algorithm and Needleman Wunsch alignment algorithm are all improved.Compared with the baseline system,the alignment algorithm based on the phone confusion matrix improves the Recall by 10.42% and the F-score by 2.97%.
Keywords/Search Tags:Speech-text alignment, Audio segmentation, Language model, Phone alignment, Speech Recognition
PDF Full Text Request
Related items