Font Size: a A A

Research On Text Classification Based On Word Vector

Posted on:2020-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:S S XuFull Text:PDF
GTID:2428330575969934Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology,the Internet has become an important way for people to obtain various information.People access information through the Internet,and constantly upload information to the Internet,causing there are more and more text data in the Internet world,so text classification is critical to quickly obtain accurate and effective information.Various social platforms and news channels publish a variety of information,and only effective classification of them can meet the needs of the majority of Internet users.Therefore,how to classify texts reasonably and effectively is very important.Text representation turns text data into computer-recognizable data,which is one of the important steps in text classification.Distributed representation(word vector)is the most widely used and the most classic text representation method,which maps all words to low-dimensional dense real-valued short vector respectively and avoids the Curse of Dimensionality of the traditional text representation model.There are a large number of non-zero real values in the word vector,each dimension of the word vector has a specific meaning,and the information of the word can be distributed to each component,including more information.Therefore,the syntactic and semantic similarity between words and words can be measured by the cosine similarity between vectors,which accurately represents the semantic relevance of related or similar words.The focus of this paper is on word vector.Firstly,the author proposes a max-CBOW model for training word vector(word-embedding),and then uses the optimized TF-IDF algorithm that considers category information to weight each word vector of each word in the text for text vector representation.Finally obtain a better text classification framework based on the word vector.The main research contents are as follows:First,this paper studies and improves the word vector generation model.The classic word2 vec word vector generation tool is studied,in the original CBOW model,the mapping from input layer to projection layer is a simple accumulation of context word vector.The author uses the most prominent information to represent semantic information,taking the corresponding dimension of the context word vector.The maximum value represents the value of the projection layer word vector.Based on the CBOW model,the max Continuous Bag-of-words model is proposed for word vector training.Second,this paper studies and improves the existing text vector representation model.This paper focuses on the text representation model based on the combination of word vector,and proposes a text representation model CT_CBOW,which uses the TF-IDF_C algorithm considering category information to weight the word vector of each word in the text.At the same time,a text classification framework based on word vector is proposed.Third,this paper will use the 100,000 company information crawled on the tianyancha website as experimental data set.This paper uses this data set and THUNews data set for comparative experiment to verify the validity of the text classification method that based on word vector proposed in this paper.
Keywords/Search Tags:word2vec, word vector, text representation, text classification, TF-IDF
PDF Full Text Request
Related items