Font Size: a A A

The Research Of Text Classification Based On Word2Vec Language Model And Graph Kernel

Posted on:2017-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y H YuanFull Text:PDF
GTID:2348330503983642Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, “big data” era is coming. In real life, text is still the main existing form of information. Faced with these disorderly and unsystematic text data, the traditional means with manual processing is not only time-consuming and laborious, but also leading to an unsatisfactory result. Therefore, the method using machine learning to represent and categorize text data automatically has attracted a lot attention from researchers.As we all know, the most important part of text classidication is the feature extraction of document. Especially, the main commonly used methods include TF-IDF, Bag-of-Words, LDA and so on. Unfortunately, these methods have some shortcomings: such as lacking of semantic information; the curse of dimensionality distress; ignoring the context structure, which will directly affect the accuracy of text classification. To address these problems and improve the accuracy, a series of researches which are innovative have been done in this paper just as follows:(1) Proposed a new feature extraction method. Based on these word embedding through Word2 Vec, and inspired by Bag-of-Words, this paper proposed a new feature representation method which called Bag-of-Clusters. Through analyzing the properties of word embedding, the method would construct a Bag-of-Clusters model to represent text. Finally, the accuracy of text classification by the proposed method has improved in the standard data set.(2) Constructed the document semantic graph. The proposed text feature extracted method has improved the accuracy of text classification. Unfortunately, it is inevitable to ignore some structure information of words. In order to make full advantages of the context structure, using graph to represent raw data is tried in this paper. And then, we encoded the nodes of graph by analyzing the property of word embedding to construct document semantic graph.(3) Designed a new graph kernel which can be applied to document semantic graph. In recent years, graph kernel is an effective method to calculate the similarity among graphs. However, there are some limitations in the existing graph kernel, such as high time complexity, poor scalability, less node type of graph. In order to match document semantic graph effectively, this paper designed a new graph kernel. Firstly, we designed reasonable bit operation for document semantic graph to enrich the structure information. Next, the label representation of graph was obtained by iterative operation. At last, we designed kernel function to compute the similarity of graphs. The results of text classification in standard data set has showed that the designed graph kernel not only improved accuracy, but also reduced the computing time complexity of computing.(4) Enriched the semantic information of edges and updated the designed graph kernel. The new graph kernel which created above still ignored the semantic information of edges in document semantic graph. To enrich semantic information of the semantic graph and expand the graph kernel to match the graphs with edges encodes, this paper added word offset into semantic label of edges. Then, the new graph kernel was updated and applied to computing similarity of document semantic graph. The experimental results showed that this method could improve accuracy in a low time complexity, and proved the effectiveness of the proposed graph kernel.In summary, this paper propose new methods to improve the accuracy of text classification. On the other hand, the new graph kernel has reduced the computing time complexity of computing. Hence, it has some research significance.
Keywords/Search Tags:Text Categorization, Word Embedding, Language Model, Graph Kernel
PDF Full Text Request
Related items