Research On Text Classification Based On Word Vector

Posted on:2020-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:S S Xu

Full Text:PDF

GTID:2428330575969934

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology,the Internet has become an important way for people to obtain various information.People access information through the Internet,and constantly upload information to the Internet,causing there are more and more text data in the Internet world,so text classification is critical to quickly obtain accurate and effective information.Various social platforms and news channels publish a variety of information,and only effective classification of them can meet the needs of the majority of Internet users.Therefore,how to classify texts reasonably and effectively is very important.Text representation turns text data into computer-recognizable data,which is one of the important steps in text classification.Distributed representation(word vector)is the most widely used and the most classic text representation method,which maps all words to low-dimensional dense real-valued short vector respectively and avoids the Curse of Dimensionality of the traditional text representation model.There are a large number of non-zero real values in the word vector,each dimension of the word vector has a specific meaning,and the information of the word can be distributed to each component,including more information.Therefore,the syntactic and semantic similarity between words and words can be measured by the cosine similarity between vectors,which accurately represents the semantic relevance of related or similar words.The focus of this paper is on word vector.Firstly,the author proposes a max-CBOW model for training word vector(word-embedding),and then uses the optimized TF-IDF algorithm that considers category information to weight each word vector of each word in the text for text vector representation.Finally obtain a better text classification framework based on the word vector.The main research contents are as follows:First,this paper studies and improves the word vector generation model.The classic word2 vec word vector generation tool is studied,in the original CBOW model,the mapping from input layer to projection layer is a simple accumulation of context word vector.The author uses the most prominent information to represent semantic information,taking the corresponding dimension of the context word vector.The maximum value represents the value of the projection layer word vector.Based on the CBOW model,the max Continuous Bag-of-words model is proposed for word vector training.Second,this paper studies and improves the existing text vector representation model.This paper focuses on the text representation model based on the combination of word vector,and proposes a text representation model CT_CBOW,which uses the TF-IDF_C algorithm considering category information to weight the word vector of each word in the text.At the same time,a text classification framework based on word vector is proposed.Third,this paper will use the 100,000 company information crawled on the tianyancha website as experimental data set.This paper uses this data set and THUNews data set for comparative experiment to verify the validity of the text classification method that based on word vector proposed in this paper.

Keywords/Search Tags:

word2vec, word vector, text representation, text classification, TF-IDF

PDF Full Text Request

Related items

1	Research On Text Classification Based On Word Vector
2	Research On Chinese Short Text Classification Based On Word Embedding
3	Research Of Text Classification Based On Word2vec And Self-attention
4	Research On Text Classification Algorithms Based On Word Vector
5	A Research On Text Vector Representation Based On Semantics
6	Research On Short Text Classification Based On Word Distributed Representation
7	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
8	Exploring Dialogue Text Classification Based On Word Mixture Vectors
9	Research And Implementation Of Text Representation In Continuous Sapce
10	Comparison And Combination Of Text Classification Based On Word2vec With SVC And AT-LSTM