Font Size: a A A

Research On Text Semantic Enhancement And Short Text Classification Method Based On Topic Model

Posted on:2022-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:H M WeiFull Text:PDF
GTID:2518306608490184Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the development of deep learning and pre-training technology,the research of natural language processing has achieved excellent results.Text representation and short text classification have an important impact on tasks,such as automatic translation,text summarization,and sentiment analysis in the field of natural language processing.Due to the complexity and diversity of natural language,there are many difficulties such as "dimension disaster","highly sparse vector" and "shallow semantic" in the current research of text semantic representation,resulting in that text vectors cannot fully express semantic information.At the same time,short texts have the characteristics of small amount of data and sparse data features,which make it difficult to classify short texts.Text semantic representation and short text classification are still the focus and difficulty of current research.Therefore,this paper combines topic model,word embedding and text classification methods to carry out the following research:(1)Proposing a new text semantic representation method named Sem2vec(Semantic to vector)model which is combined with the LDA topic model and the Word2 vec model.The Sem2 vec model adds a topic embedding layer in front of the input layer of the Word2 vec model.Firstly,the topic similarity is calculated according to the word topic distribution obtained by the LDA model.Then,the topic semantic word vectors are inputted into the Sem2 vec model instead of the one-hot vector.Constrained by maximizing log-likelihood objective function,the parameters of the Sem2 vec model are optimized.Finally,the semantic word vectors are learned by the Sem2 vec model and the semantic representation of the text is further obtained.In order to verify the effectiveness of the Sem2 vec model,experiments are carried out on Sogou,THUCNews and 20 newsgroup datasets to compare with the classic models.The results of the two tasks of semantic similarity and text classification show that compared with the classic models,Sem2 vec model is more accurate in the calculation of semantic similarity,and the classification results of textcnn,Bi LSTM and Transformer can be improved by 0.58%-3.5%,and the time performance is also improved.(2)Proposing a supervised biterm topic model named SBTM(Supervised Biterm Topic Model)model which is based on the BTM topic model.The SBTM topic model is proposed for short text classification.Based on the BTM model,topic-category distribution parameter is introduced to identify the semantic relationship between topics and categories,then topics and categories are accurately mapped to complete the topic classification of documents.Through topic classification,the word-topic probability can be calculated more accurately,so as to make the short text classification more accurate.In order to verify the effectiveness of the SBTM model,experiments were carried out on sogou news headlines,THUCNews headlines and AMAZON review short datasets to compare with the classic models.The experiment results show that compared with the classic models,SBTM model can establish an accurate mapping between topics and categories and improve the results of short text classification by 1.3%-10.2%.
Keywords/Search Tags:Text representation, Word2vec model, Topic model, Semantic enhancement, Short text classification
PDF Full Text Request
Related items