Research And Improvement On Text Classfication Based On Word Embedding

Posted on:2017-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Wang

Full Text:PDF

GTID:2308330485970927

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the amount of data is growing with a fast speed. How to get relevant information from huge amounts of resource accurately, rapidly and comprehensively, has become the focus of research in the field of information technology. Text classification is one of the important technology in the areas of text mining, and it is convenient for information retrieval and efficient management of mass data, and has important research value and significance.In this paper, we study some important techniques of text classification, including text preprocessing, text representation model, feature selection algorithm and classification algorithm. Based on detailed study of above process, we focus on the working principle of word2vec, which is a word vector training tool based on deep learning opened by Google, and apply it to the improvement of traditional feature selection algorithm.Feature selection is a very important part of text classification, without feature selection to reduce dimension will lead to "curse of dimensionality" when process high-dimensional text. Feature selection not only affects classifiers results, but also affects training time. In this paper, we study the most commonly used feature selection algorithms, including information gain, chi-square, mutual information, and also analyzes their advantages and disadvantages. Because chi-square feature selection algorithm has defect on "feature words incomplete", we propose improved text feature selection algorithm based on word vector. Additionally, we put forward an assumption that the feature items which are similar to the ones have strong category distinguish ability, would also have strong ability to distinguish categories. We apply word vector which word2vec trains to the process of traditional feature selection, by the similarity between word vectors, we supplement feature words to make up "feature words incomplete" deficiencies. Because feature selection algorithm based on chi-square exists "low-frequency words defect" problem, so we propose an improved algorithm combined concentration and dispersion.Take chi-square test as a feature extraction algorithm, SVM as classification algorithm, we develop an automatic text classification system. Based on the system, we examine the effectiveness and feasibility of proposed improved algorithm by a large number of experiments. We use Chinese text classification corpus opened by Sougou Laboratory as experimental data, and take accuracy, recall and F value as a measure. The experimental results show the proposed feature selection algorithm based on word vector is better than traditional methods obviously. In addition, the results also show the improved feature selection algorithm combined concentration and dispersion has better results.

Keywords/Search Tags:

text categorization, feature selection, word embedding, word2vec, similarity

PDF Full Text Request

Related items

1	Text Categorization Algorithm Based On Machine Learning
2	Text Representation Model And Feature Selection Algorithm
3	Research On Chinese Text Categorization Algorithms Based On Technology Text
4	Design And Realization Of Text Abstract System Based On Word Embedding
5	A Study On Key Issues Of Automated Text Categorization For Chinese Documents
6	Key Techniques Of Text Ming On Criminal Cases
7	Research On Text Feature Selection Algorithm And Its Application In Micro-Blog
8	A Study On Optimization Of Pre-trained Chinese Word Embedding In Transfer Learning
9	The Research Of Text Representation And Feature Selection In Text Categorization
10	Feature Selection And Feature Representation Text Classification Based On Convolutional Neural Networks