Research On The Method Of Short Text Categorization Based On Topical Similarity

Posted on:2018-06-28

Degree:Master

Type:Thesis

Country:China

Candidate:B Li

Full Text:PDF

GTID:2348330518982353

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the widespread use of the Internet, the emergence of new media, such as WeChat, micro-blog and question answering system, makes the Internet produce huge amounts of short text messages every day. These short texts are short in length, small in content, nonstandard in terms of words and large in amount of data, and belong to semi-structured information data. The method of long text processing is applied directly to text mining of short text, and it is difficult to obtain satisfactory text mining results. Therefore, how to accurately, real-time and efficiently mine the hidden information in short text is the focus of discussion and Research on Chinese information processing and text mining.Short text has the characteristics of short structure, little text content, large amount and unclear semantics, which leads to short text classification, such as sparse features, noisy and context dependence. Short text categorization method based on search engine, classification results are more dependent on search engines. Large scale corpus based classification methods are more dependent on external corpora. Based on the analysis of the characteristics of short text, the defects of short text classification methods at present, on the basis of modeling, sparse matrix feature short text short text context dependent and other issues, according to the theme of exploration to judge the similarity of short text classification to achieve.Firstly, we study the literature and analyze the theory and method of Chinese text categorization, and focus on the short text classification method. When analyzing the traditional short text classification method based on VSM, it is found that the feature matrix of short text modeling is sparse and high-dimensional, which is not conducive to accurate classification. Therefore, a classification algorithm based on topic similarity is designed. Applying the theory and method of topic mining, the LDA probabilistic model is adopted to estimate the topic probability distribution vector of short texts.Secondly, in view of the traditional KNN algorithm in the classification process,the amount of calculation is particularly large, the processing of large text sets of short text sets, the amount of calculation will be greater. In this paper, we construct an improved LSH classifier based on the local sensitive hash to solve the ANN problem,and realize the fast classification of short text from the topic level. KNN classifier is used to solve the problemFinally, this paper theoretically describes the construction of improved LSH classifier based on KNN,which can improve the classification effect and reduce the classification time to a certain extent. According to the classification and text classification method construction, modeling in the Linux environment, using MATLAB classification, design experiment of VSM classification method based on the comparison of the final results, the classification method of theme based on the similarity of the overall classification performance is good.

Keywords/Search Tags:

Short text, Topic Similarity, LDA model, Text Categorization, KNN

PDF Full Text Request

Related items

1	Topic Categorization Of Short Text Sequences
2	Key Technology Research On Short Text Similarity
3	A Biterm Pseudo Document Topic Model For Short Text
4	Forum Topic Model Based On A Combination Of Selective Long Text And Short Text
5	Research On Topic Evolution Of Short Text Based On Self-Aggregation Strategy
6	Interst Modeling Based On Short Text Analysising
7	A Short Text Similarity Calculation Method Based On Feature Extension Using BTM Topic Model
8	The Text Categorization And Structure Of Theme Words Network Based On Topic Models
9	Research On Short Text Classification Based On Topic Model
10	Research On Short Text Topic Discovery Based On BTM Topic Model