Font Size: a A A

Research And Application Of Speech Recognition Based On Syllable Modeling

Posted on:2022-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ZhuoFull Text:PDF
GTID:2518306749972059Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
With the development of the Internet economy,applications such as audiobooks and Podcasts have entered people's daily life,and the demand for efficient recognition and understanding of online speech content continues to increase.To alleviate the dependence on training data,Hanyu Pinyin is adopted as an intermediate result between speech input and text results,splitting the working process into two stages: first recognizing acoustic features into Hanyu Pinyin(as acoustic model stage),and then transforming Hanyu Pinyin into the expected text results(as language model stage).The main work of this thesis is as follows:(1)In acoustic model,a peaky distribution is shown when connectionist temporal classification(CTC)loss function predicts non-blank labels,and the predicted locations often deviate significantly from the true positions,which is prone to performance impairment.Since Mandarin Chinese is a syllable language and is pronounced with approximately equal duration for each syllable,equal interval prior can be introduced in acoustic modeling to largely limit the CTC path search range and reduce the computational cost.To address the problem of peaky distribution of CTC loss prediction,equal interval prior is introduced into CTC loss to limit the CTC path search range and improve the performance of the acoustic model.The performance of acoustic models based on DFSMN networks were compared on the speech-to-pinyin conversion task.It is verified that,compared with CTC,the equal interval prior-based Es CTC algorithm has a positive effect on the acoustic model.The character error rate(CER)on the dataset AISHELL-1 is reduced by 3.76% compared to DFCNN baseline model.(2)In language model,the intermediate results of Hanyu Pinyin need to be converted into expected text results.However,various types of errors are output by the acoustic model,which can reduce the accuracy of conversion to text results.Since the semantic information of Hanyu Pinyin exists in combinatorial relationship of adjacent tokens,the ability on error correction for Hanyu Pinyin sequences and the quality of the text results can be improved by enhancing the modeling of local context.Enhancing the semantic modeling of the local context of Hanyu Pinyin texts is introduced into language model to improve the ability of error correction.A local semantic enhancement method based on Gaussian distribution is introduced into the self-attention network(SAN),and two submodels for pinyin error correction and pinyin-to-chinese conversion are designed and combined in cascade.The CER is reduced by 3.0% compared with the Transformer baseline model.(3)Based on the above results,a speech recognition system for Uyghur and a following machine translation system for Uyghur-> Chinese was designed and built.In the test against the THUYG-20 test set and CWMT 2017 test set,respectively,the WER for Uyghur speech recognition is reduced by 11.61% compared with the THUGY-20 baseline,and the BLEU for Uyghur-> Chinese translation is increased by 5.84 compared with the standard Transformer model as baseline,which proves that the above methods can be extended to different languages.
Keywords/Search Tags:Mandarin speech recognition, Hanyu Pinyin, Equal spacing, Local semantic enhancement
PDF Full Text Request
Related items