Text Classification Based On Word Vector And Topic Vector

Posted on:2017-07-06

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Guo

Full Text:PDF

GTID:2348330509960272

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, more and more text information needs to be processed in our daily life. How to extract the target information from the vast amounts of text information, so as to give people a better services and experience is a major challenge. The text classification algorithm is an important technology to solve this problem. As a basic technology, text classification has been applied in library intelligent management, news recommendation, text sentiment analysis, text information filtering and so on. These applications have made people’s life much more convenient. Based on the research of the existing text classification technology, we find the shortage of existing algorithms, and propose a new text feature extraction framework. The main contents are as follows:(1) The word vector algorithm can represent the similarity between words, and it can extract a better feature. The Word2 Vec algorithm is one of the best word vector algorithm, which is more excellent in performance and speed. But the word vector cannot solve the problem of polysemy, and the word vector only represents the context information which is lack of global information. We propose a method of combining the topic with word vector and get the topic vector which is similar to word vector. Because the same words may have different topic vectors, and the topic vector is a global level information. Combining the word vector with the topic vector is better to introduce the full text information.(2) Some feature extraction methods directly accumulate word vector or similar low dimension feature method. This low dimension feature is not conducive to represent text feature with high dimension dictionary, and abandons the advantages of high dimensional vector space model for text classification problem. So this thesis reserves the vector space model by using Adaptive-means clustering algorithm. The adaptive clustering algorithm is combined with the word vector and topic vector, which makes the words which have similar meaning in the same clusters. So the contribution to features is same with similar words. In addition, this thesis uses n-gram to increase the context information, as well as expanding the feature of the short text, and gets the final text features.(3) This thesis uses two news data sets to verify the algorithm, and compares the results with other existing algorithms. The advantage of combing the word vector with the topic vector and using high dimension vector space model is proved in the experiments. Finally, the effect of the parameters in the experiment are analyzed, and the general method of parameter selection is proposed. Finally, a set of text classification scheme is determined, which provides classification results for the following work of news recommendation.

Keywords/Search Tags:

text classification, word vector, topic vector, vector space model, similarity measure, adaptive clustering

PDF Full Text Request

Related items

1	Research On The Construction Method Of Technology Domain Thematic Library Based On Multilevel Topic Vector
2	Research On Semantic Representation Of Text Based On Topic Model
3	Improved Vector Space Model And Its Application To Document Classification System
4	Study On Similarity-based Text Clustering Algorithm And It's Application
5	Research On English Text Clustering Method Based On Vector Space
6	Research And Implementation Of Chinese Text Clustering Algorithms
7	Text Classification Algorithm Based On Chinese And English Topic Space
8	Design And Implementation Of The Character Classification System Used In Search Engine
9	Research On Key Techniques Of Cross-Language Text Similarity Detection Based On Word Vector
10	Research On Text Similarity Algorithm Based On VSM Combined With Word Semantics