Font Size: a A A

Text Classification Based On Word Vector And Topic Vector

Posted on:2017-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:H Y GuoFull Text:PDF
GTID:2348330509960272Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, more and more text information needs to be processed in our daily life. How to extract the target information from the vast amounts of text information, so as to give people a better services and experience is a major challenge. The text classification algorithm is an important technology to solve this problem. As a basic technology, text classification has been applied in library intelligent management, news recommendation, text sentiment analysis, text information filtering and so on. These applications have made people's life much more convenient. Based on the research of the existing text classification technology, we find the shortage of existing algorithms, and propose a new text feature extraction framework. The main contents are as follows:(1) The word vector algorithm can represent the similarity between words, and it can extract a better feature. The Word2 Vec algorithm is one of the best word vector algorithm, which is more excellent in performance and speed. But the word vector cannot solve the problem of polysemy, and the word vector only represents the context information which is lack of global information. We propose a method of combining the topic with word vector and get the topic vector which is similar to word vector. Because the same words may have different topic vectors, and the topic vector is a global level information. Combining the word vector with the topic vector is better to introduce the full text information.(2) Some feature extraction methods directly accumulate word vector or similar low dimension feature method. This low dimension feature is not conducive to represent text feature with high dimension dictionary, and abandons the advantages of high dimensional vector space model for text classification problem. So this thesis reserves the vector space model by using Adaptive-means clustering algorithm. The adaptive clustering algorithm is combined with the word vector and topic vector, which makes the words which have similar meaning in the same clusters. So the contribution to features is same with similar words. In addition, this thesis uses n-gram to increase the context information, as well as expanding the feature of the short text, and gets the final text features.(3) This thesis uses two news data sets to verify the algorithm, and compares the results with other existing algorithms. The advantage of combing the word vector with the topic vector and using high dimension vector space model is proved in the experiments. Finally, the effect of the parameters in the experiment are analyzed, and the general method of parameter selection is proposed. Finally, a set of text classification scheme is determined, which provides classification results for the following work of news recommendation.
Keywords/Search Tags:text classification, word vector, topic vector, vector space model, similarity measure, adaptive clustering
PDF Full Text Request
Related items