Font Size: a A A

Chinese Text Classification Based On The Theme Model And Related Technology Research

Posted on:2015-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z WangFull Text:PDF
GTID:2298330452953404Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of Web information, how to find valuable informationfrom a large number of information quickly become more difficult, according to thisdemand, how to process the intelligent information by computer has ben widelystudied. Automatic text classification and similarity calculation become researchfocus in some areas such as information extraction, information retrieval and naturallanguage processing, has been the rapid development and wide application. In recentyears, machine learning methods are widely used on automatic text classification,compared with the traditional text classification techniques, has good research resultsand better value. Characteristic spatial dimension is too high will result a very largeamount of computation in the classification process, consuming huge storage space,In automatic text classification process, because of the problem of data skew andnoise data, the samples cannot correctly reflect the real data distribution, so just usingtraditional classification techniques cannot achieve the desired effect of classification.Other, feature selection is an important factor affecting the performance of textclassification, the bad performance of dimensionality reduction will reduce the effectof text classification.In order to solve the problem better, we introduces a new probabilistic topicmodel LDA in automatic text classification, modeling the text set by LDA model,focus on the potential of text set semantic relations, the dimensions of the data spaceis mapped onto a smaller space theme, then combined with SVM classificationalgorithm trained classifier, and final results show that this method significantlyimproves text classification results.The paper includes the following three research:1.Propose LDA probabilistic topic model is popular in recent years, for largetext data sets, on the various types of training data set modeling with the LDA mode,find hidden topics information in the text set, using Gibbs sampling algorithminference parameters, indirect calculation model parameters, effectively extract topicsfrom large text, at last, get the text set theme mixed probability distribution. Greatlyreduces the dimension of the space, shortening classifier training time.2. The LDA model lead into text similarity is one of the key content of this paper,the method uses the above LDA model fit the data set, get hidden text theme matrix,calculate the similarity between texts by JS distance, the experimental results showthat this method is better than the calculation method based on vector space model.3. The LDA model lead into the classification method, combined with supportvector machines SVM classification algorithm, is another important aspect of this research. This method takes full advantage of the powerful text representation anddimensionality reduction ability by LDA model and SVM strong classificationcapability, for each type of text sets LDA modeling, constructor LDA model, then useSVM algorithm training classifiers for all sub-LDA model, experimental results showthat this method is better than the traditional text classification techniques.
Keywords/Search Tags:Text Classification, LDA model, Similarity Calculation, SVM Algorithm
PDF Full Text Request
Related items