Font Size: a A A

Research On End-to-end Speech Translation Of Low-resource Languages Based On Transfer Learnin

Posted on:2024-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:N LiFull Text:PDF
GTID:2568306926984759Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
This paper provides a detailed guide on how to conduct end-to-end speech translation research from scratch.It covers topics such as building data sets,utilizing transfer learning,and implementing hybrid coding to improve the effectiveness of speech translation.Automatic speech translation is a technology that converts speech in one language into text in another language.There are two main methods for achieving this:cascading and end-to-end.In recent years,cascading speech translation has faced development bottlenecks due to error transmission issues,leading to the rise of end-to-end speech translation as the mainstream research approach.End-to-end voice translation has the ability to directly translate speech from the source language into text in the target language.However,it also faces several challenges such as data scarcity,particularly for Chinese and low-resource languages.This lack of voice translation data makes it difficult to conduct related research on voice translation.In addition,existing models cannot fully utilize scarce data to effectively converge,resulting in poor model results when data is absent.This paper uses Uyghur speech-Chinese text(referred to as"Uyghur-Chinese")and Mongolian speech-Chinese text(referred to as"Mongolian-Chinese")speech translation as an example to study the end-to-end speech translation of low-resource languages.The study mainly focuses on three aspects:1.To address the shortcomings of domestic speech translation datasets,this paper introduces two methods for constructing voice translation datasets.The first method involves using existing machine translation technology combined with manual correction to quickly build datasets.The second method involves using modern crawler technology to collect audio data,cutting the audio,and manually annotating it.During the dataset construction process,expert validation was invited to improve dataset quality while maximizing automation.As a result,a 20-hour Uyghur-Chinese speech translation dataset and a 2-hour Mongolian-Chinese speech translation dataset were constructed.2.To overcome the challenges posed by data scarcity and high training difficulty,this paper utilizes transfer learning with the help of knowledge in the field of target language speech recognition.Additionally,an adapter structure is added to form a mapping mechanism between the source and target languages,narrowing the gap between them and improving speech translation effectiveness.This approach enables full utilization of existing data while addressing training difficulties.This paper uses the Chinese speech recognition dataset AISHELL to conduct target language transfer learning research using the Transformer and Conforming models.The decoder is used to fully learn Chinese language information,and an adapter structure is added to build a mapping relationship between the source language and Chinese.Compared to other end-to-end speech translation methods,this approach has the advantages of a simple model and complete convergence in data scarcity.As a result,it is an effective method for end-to-end speech translation in low-resource languages.3.To address the problem of misalignment between the Attention mechanism voice and text data in Transformer,this paper introduces Connectionist temporal classification(CTC)to construct a hybrid CTC/Attention model.The model is jointly decoded to solve the problem of misalignment.Additionally,this paper introduces CTC in Transformers and Converters,which can help force alignment and solve phoneme boundary and loop frame problems in speech translation.
Keywords/Search Tags:Speech Translation, Data Set Construction, Transfer Learning, Joint Decoding, End-to-End
PDF Full Text Request
Related items