Font Size: a A A

Research On Text Classification Based On Word2vec Word Vector

Posted on:2018-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhuFull Text:PDF
GTID:2348330536973557Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Automatic text classification is one of the important technologies in the field of text mining,which provides great convenience for information retrieval and data management.In recent years,with the rapid development of Internet technology,text data expands rapidly every day,such as microblogging information,news content,user e-mail messages and forums,blog posts,etc.Automatic text classification is an effective tool for handling and organizing these textual data,and has been applied in many ways,such as microblogging emotional classification,spam filtering and automatic distribution of news content.In the future,text data will also increase dramatically,and automatic text categorization will play an increasingly important role in these areas.Automatic text classification includes several techniques,such as text preprocessing,text representation,feature selection,feature extraction and classification algorithms.Among them,text representation and classification algorithms play a decisive role in these technologies,which directly affect the classification results.At present,most scholars' research on text categorization mainly focuses on feature selection,feature extraction,text representation and the optimization of classification algorithm.In the many text representation models,the vector space model(VSM)weighted with term frequency-inverse document frequency(TF-IDF)is a mainstream text representation model(we call it VSM_TFIDF model),which has a good performance in both industry and academia.But the model can not represent the semantic information of the text well,and it can not reflect the context semantics and syntactic information of a feature word.Moreover,the commonly similarity measures used in text classification,such as euclidean distance,cosine distance,etc.,can not be a good measure of the similarity between texts represented by such models.In view of the above problems,we use the Word2 vec word vector to introduce semantic information into text representation model or similarity measure,so as to enhance the effect of text classification.We first study the generation theory of Word2 vec word vector,including its two models(CBOW model and Skip-gram model),and two sets of optimization schemes to improve word vector training efficiency(Hierarchical Softmax and Negative Sampling).Then based on Word2 vec word vector,we can improve the above problems.Specifically,the main research work of the paper includes the following two aspects:(1)Proposing a multi-granularity multi-model combination of textrepresentation based on Word2 vec word vector and VSM_TFIDF model,CombineTextVector in short.Because Word2 vec word vector can express thesemantic information of characteristic words well,we consider combining it withVSM_TFIDF model to improve the effect of text representation.We first study theTF-IDF weighting formula and find the shortcomings of its class distinction ability,and improve it,and then combine it with the Word2 vec word vector to construct amulti-granularity text representation model,Word2vec_wTFIDF in short,finallycombined with the traditional VSM_TFIDF model,we construct a new textrepresentation model,CombineTextVector in short.In order to verify the performanceof the new model,we designed experiments on Fudanian Chinese text categorizationcorpus,compared with the mainstream text representation models,and obtained thehigher F1 scores,verifying the validity of the model.(2)Proposing a new distance measurement method based on Word2 vec wordvector and EMD distance for topic model,TopMD in short.We first analyze thedistance measurement methods commonly used in the traditional VSM_TFIDF modeland topic model.In view of the problem that the semantic similarity between texts cannot be measured well by these methods,then we combine the EMD distancemeasurement method with Word2 vec word vector to construct a new distancemeasurement method for topic model,TopMD in short.Compared with the commonlymeasures,the new method can consider more semantic similarity informationbetween the feature words.In order to verify the effectiveness of the proposed method,we conducted experiments in Chinese and English corpus,in contrast to a variety ofdistance measurement methods.Experimental results show that the method canimprove the effect of text similarity measure in topic model compared with the traditional method.
Keywords/Search Tags:Word2vec model, text representation, text classification, Distance measurement
PDF Full Text Request
Related items