Font Size: a A A

Research And Improvement On Text Classfication Based On Word Embedding

Posted on:2017-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:M Y WangFull Text:PDF
GTID:2308330485970927Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the amount of data is growing with a fast speed. How to get relevant information from huge amounts of resource accurately, rapidly and comprehensively, has become the focus of research in the field of information technology. Text classification is one of the important technology in the areas of text mining, and it is convenient for information retrieval and efficient management of mass data, and has important research value and significance.In this paper, we study some important techniques of text classification, including text preprocessing, text representation model, feature selection algorithm and classification algorithm. Based on detailed study of above process, we focus on the working principle of word2vec, which is a word vector training tool based on deep learning opened by Google, and apply it to the improvement of traditional feature selection algorithm.Feature selection is a very important part of text classification, without feature selection to reduce dimension will lead to "curse of dimensionality" when process high-dimensional text. Feature selection not only affects classifiers results, but also affects training time. In this paper, we study the most commonly used feature selection algorithms, including information gain, chi-square, mutual information, and also analyzes their advantages and disadvantages. Because chi-square feature selection algorithm has defect on "feature words incomplete", we propose improved text feature selection algorithm based on word vector. Additionally, we put forward an assumption that the feature items which are similar to the ones have strong category distinguish ability, would also have strong ability to distinguish categories. We apply word vector which word2vec trains to the process of traditional feature selection, by the similarity between word vectors, we supplement feature words to make up "feature words incomplete" deficiencies. Because feature selection algorithm based on chi-square exists "low-frequency words defect" problem, so we propose an improved algorithm combined concentration and dispersion.Take chi-square test as a feature extraction algorithm, SVM as classification algorithm, we develop an automatic text classification system. Based on the system, we examine the effectiveness and feasibility of proposed improved algorithm by a large number of experiments. We use Chinese text classification corpus opened by Sougou Laboratory as experimental data, and take accuracy, recall and F value as a measure. The experimental results show the proposed feature selection algorithm based on word vector is better than traditional methods obviously. In addition, the results also show the improved feature selection algorithm combined concentration and dispersion has better results.
Keywords/Search Tags:text categorization, feature selection, word embedding, word2vec, similarity
PDF Full Text Request
Related items