Font Size: a A A

The Design Of Automatic Video Subtitle Generation System Based On Speech Recognition

Posted on:2022-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WanFull Text:PDF
GTID:2518306572497624Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the Internet era,video is one of the important data carriers.For non-native speakers or people with hearing impairment,subtitles can effectively help them understand the content of the video.With the rapid development of artificial intelligence,automatic generation of video subtitles has become possible,and the key technology applied is automatic speech recognition.The structure of traditional speech recognition methods is complex,including acoustic model,pronunciation model and language model.Each module needs to be trained and optimized separately,which is difficult to train and optimize globally.Therefore,this paper focuses on the more promising end-to-end model.Among the end-to-end models in the field of speech recognition,the encoder-decoder model based on the attention mechanism performs best.It uses the memory capacity of the neural network to complete the mapping between input and output.However,this model is difficult to effectively learn alignment between sequences in the case of insufficient data.Therefore,the speech recognition model in this paper is based on connectist temporal classification(CTC)and attention mechanism,and uses the alignment ability of CTC to assist the decoding of attention decoder.On the other hand,the use of basic units based on the self-attention mechanism to build decoders and encoders has stronger temporal feature extraction capabilities and faster calculation speeds than recurrent neural networks.According to the characteristics of speech recognition,two-dimensional convolution unit and one-dimensional convolution unit are used for dimension adjustment and position coding at the input of encoder and decoder respectively.In addition,in order to enhance the model language modeling ability,inspired by other data Augment methods,a data augment method based on word masking is proposed.In the process of model training,some spectral features are masked by words to improve the language modeling ability of the model.Through this method,after preliminary training on the data set Libri Speech-clean-100,the word error rate on the Dev-other and Test-other test sets are reduced by 2.4% and 2.5%,respectively.Based on the above scheme,the word error rate of the final model is only 10.3%on the comprehensive data set containing the self-made data set.Based on the above research results,combined with the scene of video subtitle generation,this paper designs and implements an automatic video subtitle generation system based on speech recognition,and uses a variety of distributed middleware to build a computing cluster to efficiently complete the task of subtitle generation.It further verifies the practicability of the model and improvement strategy in this paper.
Keywords/Search Tags:Automatic Video Subtitle Generation, Automatic speech recognition, Based on CTC and attention mechanism, Word mask
PDF Full Text Request
Related items