Font Size: a A A

Research On End To End Uyghur Speech Recognition Technology

Posted on:2022-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2518306539498364Subject:Engineering
Abstract/Summary:PDF Full Text Request
Voice signal is the most direct carrier to convey information and express emotion.The task of automatic speech recognition(ASR)technology is to realize the speech communication between human and machine,which convert audio signal sequence to text sequence.Nowadays,ASR has been widely used in various fields which has brought great convenience to human society,such as vehicle voice system,intelligent voice customer service and so on.There are two main methods for speech recognition modeling,one modeling method based on Hidden Markov chain and another based on end-to-end technology.The end-to-end speech recognition technology is the current research hotspot which does not need additional pronunciation dictionary and forced alignment processing,it has a more direct structure.Uyghur,as a poor resource language,not only has scarce audio resources with annotation,but also has extremely low corpus quality,which makes it difficult to achieve good results using end-to-end speech recognition system.This paper focuses on the research of Uyghur end-to-end speech recognition,the main work and innovation are as follows.This paper explores the influence of different speech recognition units on the endto-end speech recognition system of Uyghur.In this paper,we deploy end to end ASR system based on CTC which use BLSTM as the encoder on Uyghur corpus-THUYG-20.The experiment shows that the best recognition results can be obtained by using 33 characters in the Uyghur phoneme set as recognition units.In end-to-end speech recognition task,acoustic features are extracted in frames from audio signal as the input of the acoustic encoding network,which loses the context information to a certain extent.To solve this problem,this paper uses hybrid dilated CNN(HDC)to do further feature sampling for acoustic features.Compared with the classical convolution network(CNN),HDC expands the perception of contextual information.Compared with dilated CNN,HDC reduces the loss of temporal information.The paper build an end-to-end speech recognition system based on CTC using HDCConformer as encoding network on the Uyghur corpus THUYG-20.Without the introduction of language models,compared with CNN-Conformer(K=3),word error rate(WER)decreases by 1.1%,and compared with CNN-Conformer(K=5),WER dropped by0.8%.When the language model is introduced,compared with BLSTM,WER decreased by 13.6%.It's hard to capture the context dependency information well when using small granularity(character,phoneme)as modeling unit in end-to-end speech recognition task,which leads to the relatively high WER in the recognition results.This paper proposes a method which use the language model with large granularity(word),combined with the minimum editing distance to rescore and correct the recognition results.This paper rescore and correct the result of end to end speech recognition that use BLSTM,CNNConformer,HDC-Conformer as encoder based on CTC,WER decreased by 5.7%,6.3%and 9.1% respectively.
Keywords/Search Tags:End-to-End, Uyghur, Speech Recognition, Acoustic Model, Language Model
PDF Full Text Request
Related items