Font Size: a A A

A Study Of Punctuation Prediction Model By Considering Domain Information

Posted on:2021-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:W Q WeiFull Text:PDF
GTID:2518306548985859Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the improvement in the performance of automatic speech recognition technology,automatic speech recognition is becoming more and more popular in industry and daily life.Speech recognition technology has been widely used in many fields,such as intelligent housing systems,conversation transcribers,speech dictation technology and simultaneous interpretation which brought great convenience to people's lives.In most cases,speech signals are transcribed into text information and then analyzed accordingly.In this case,the quality of the transcribed text directly affects the performance of subsequent tasks.However,most automatic speech recognition systems generate only a sequence of words which do not have any punctuation symbols.And punctuation acts as a pause in a sentence which usually emphasizes certain words or phrases in order to better convey the meaning of the sentence.Lack of punctuation causes problems that make both the human reader and off-the-shelf natural language processing algorithms confused.It is an important issue to add punctuation in the text automatically.So far,there have been plenty of studies on auto punctuation prediction.Before deep learning becomes a trend,the main methods were artificial rule,with the increase of data size,some methods based on statistics become the mainstream,such as use N-Gram language model train on text that was punctuated,or treated punctuation prediction task as a sequence labeling task,then use conditional random fields(CRFs)to solve it.With the development of deep learning,many researchers start to use it in punctuation prediction task.People's communication is usually cross many domains,each domain has its own lexicon and writing styles,for this reason,it is helpful for punctuation prediction by considering domain information.In previous studies,main approaches are using acoustic features and textual features,such as the part of speech tag,word vector,pause durations between words,pitch and so on.However,few studies consider the specialties of different domains.For these reasons,the corpus of THUCNews is use in our experiments,which is collected from the subscription channel of Sina News RSS between 2005 and 2011,which contains text on science,technology,current politics,game,household,and other domains.In this thesis,we first verified that the domain information is useful for punctuation prediction task,then we proposed two approaches considering domain information to do this work.To verify the usefulness of domain information for punctuation prediction task,The Bidirectional Long Short-Term Memory(Bi LSTM)model use as baseline in this study,which consists of one Bi LSTM layer and one time-distributed dense layer.And trained the model on two different train sets with similar size,one is set contains only one domain text,the other set contains multiple domains text,and test them on the same set.And the performance of training on the Multi domain is better,it indicated that the domain information is useful for punctuation prediction task.For using domain information in punctuation prediction task,we use two approaches,the first way to add domain information is to turn domain tags into one-hot encoder.And then combine with the word vector as the input of the punctuation prediction model.The other method is used multitask learning(MTL),by using multitask learning method to combine the domain information implicitly,the model enables to focus its attention on the features which truly relevant,for other tasks will provide extra information for which are useful features.We proposed a punctuation prediction model based on multitask learning(MTL)approach which has two tasks,one is punctuation prediction and the other is domain classification.In this model,two tasks train parameters together in a shared layer,such that punctuation prediction task can obtain domain information from domain classification task.Although we verified our conjecture in this study,our experiment did not fully use domain information in this experiment,and there is still much work to do.
Keywords/Search Tags:Punctuation Prediction, Multitask, Domain Information
PDF Full Text Request
Related items