Font Size: a A A

Research On The Method Of Short Text Categorization Based On Topical Similarity

Posted on:2018-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:B LiFull Text:PDF
GTID:2348330518982353Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the widespread use of the Internet, the emergence of new media, such as WeChat, micro-blog and question answering system, makes the Internet produce huge amounts of short text messages every day. These short texts are short in length, small in content, nonstandard in terms of words and large in amount of data, and belong to semi-structured information data. The method of long text processing is applied directly to text mining of short text, and it is difficult to obtain satisfactory text mining results. Therefore, how to accurately, real-time and efficiently mine the hidden information in short text is the focus of discussion and Research on Chinese information processing and text mining.Short text has the characteristics of short structure, little text content, large amount and unclear semantics, which leads to short text classification, such as sparse features, noisy and context dependence. Short text categorization method based on search engine, classification results are more dependent on search engines. Large scale corpus based classification methods are more dependent on external corpora. Based on the analysis of the characteristics of short text, the defects of short text classification methods at present, on the basis of modeling, sparse matrix feature short text short text context dependent and other issues, according to the theme of exploration to judge the similarity of short text classification to achieve.Firstly, we study the literature and analyze the theory and method of Chinese text categorization, and focus on the short text classification method. When analyzing the traditional short text classification method based on VSM, it is found that the feature matrix of short text modeling is sparse and high-dimensional, which is not conducive to accurate classification. Therefore, a classification algorithm based on topic similarity is designed. Applying the theory and method of topic mining, the LDA probabilistic model is adopted to estimate the topic probability distribution vector of short texts.Secondly, in view of the traditional KNN algorithm in the classification process,the amount of calculation is particularly large, the processing of large text sets of short text sets, the amount of calculation will be greater. In this paper, we construct an improved LSH classifier based on the local sensitive hash to solve the ANN problem,and realize the fast classification of short text from the topic level. KNN classifier is used to solve the problemFinally, this paper theoretically describes the construction of improved LSH classifier based on KNN,which can improve the classification effect and reduce the classification time to a certain extent. According to the classification and text classification method construction, modeling in the Linux environment, using MATLAB classification, design experiment of VSM classification method based on the comparison of the final results, the classification method of theme based on the similarity of the overall classification performance is good.
Keywords/Search Tags:Short text, Topic Similarity, LDA model, Text Categorization, KNN
PDF Full Text Request
Related items