Font Size: a A A

Research On Short Text Classification Based On Semantic Extension

Posted on:2020-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2428330602952145Subject:Information Science
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has intensified the progress of the information age.Short texts exist as a simple and efficient expression in various social networking sites,such as weibo,news headlines,product reviews,BBS,circle of We Chat friends etc,It is becoming more and more difficult to extract useful information from these massive text resources..Due to the sparseness,immediacy,mass and irregularity of short texts,The traditional classification method still has insufficient text semantic information extraction and serious data sparseness.At present,the introduction of external knowledge base to extend the semantic information of short text is a hot research direction.How to obtain multi-layer semantic expression in text and eliminate the influence of irrelevant terms in short text has become an important research of short text classification.question.In view of the above problems and referring to the existing research results,this thesis introduces the idea of semantic feature expansion,and uses Probase semantic network as external knowledge base to expand short text by word conceptualization and increase semantic co-occurrence words to make it better.Express the information implied in the short text to achieve the effect of disambiguation.Then the Word2 vec model is used to train the semantic information word vector,which solves the problem of data sparsity and semantic deficiency between words.Based on the traditional classification model,a short text classification method based on semantic extension is proposed.This theisis firstly analyzes the unique characteristics of short texts and traditional short text classification techniques,points out the shortcomings of traditional short text classification models,and determines the advantages of Probase knowledge base in extending short text semantic information compared with other knowledge bases.Secondly,it is concluded that each word in the short text conforms to the conceptual word and co-occurrence word of the context,and then is added to the text as the semantic information of the word,at the same time,according to the concept of context to select the most representative matching,and remove the fuzzy terms,Combining Probase semantic network and Word2 vec word vector to represent the eigenvectors of text,this method can not only enrich the semantic information of short text,but also accurately represent the interrelationship between words and the expression of context structure;Thirdly,the traditional classification model is optimized from text preprocessing,text representation and other steps.The conceptualized short text classification method based on Word2 vec model is adopted to solve the problem of high dimension and sparsity of text feature vectors in the traditional classification model,so as to obtain high-quality vector representation of semantic feature words.Finally,by comparing the existing classification methods,the LIBSVM algorithm is selected for short text classification,and the short text classification method based on semantic extension proposed in this thesis is compared with the traditional classification method.The experimental results show that the proposed method can achieve better classification results.
Keywords/Search Tags:short text, Probase, Word2vec, Feature extension
PDF Full Text Request
Related items