Chinese Text Classification Based On The Theme Model And Related Technology Research

Posted on:2015-08-30

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Wang

Full Text:PDF

GTID:2298330452953404

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid growth of Web information, how to find valuable informationfrom a large number of information quickly become more difficult, according to thisdemand, how to process the intelligent information by computer has ben widelystudied. Automatic text classification and similarity calculation become researchfocus in some areas such as information extraction, information retrieval and naturallanguage processing, has been the rapid development and wide application. In recentyears, machine learning methods are widely used on automatic text classification,compared with the traditional text classification techniques, has good research resultsand better value. Characteristic spatial dimension is too high will result a very largeamount of computation in the classification process, consuming huge storage space,In automatic text classification process, because of the problem of data skew andnoise data, the samples cannot correctly reflect the real data distribution, so just usingtraditional classification techniques cannot achieve the desired effect of classification.Other, feature selection is an important factor affecting the performance of textclassification, the bad performance of dimensionality reduction will reduce the effectof text classification.In order to solve the problem better, we introduces a new probabilistic topicmodel LDA in automatic text classification, modeling the text set by LDA model,focus on the potential of text set semantic relations, the dimensions of the data spaceis mapped onto a smaller space theme, then combined with SVM classificationalgorithm trained classifier, and final results show that this method significantlyimproves text classification results.The paper includes the following three research:1ï¼ŽPropose LDA probabilistic topic model is popular in recent years, for largetext data sets, on the various types of training data set modeling with the LDA mode,find hidden topics information in the text set, using Gibbs sampling algorithminference parameters, indirect calculation model parameters, effectively extract topicsfrom large text, at last, get the text set theme mixed probability distribution. Greatlyreduces the dimension of the space, shortening classifier training time.2. The LDA model lead into text similarity is one of the key content of this paper,the method uses the above LDA model fit the data set, get hidden text theme matrix,calculate the similarity between texts by JS distance, the experimental results showthat this method is better than the calculation method based on vector space model.3. The LDA model lead into the classification method, combined with supportvector machines SVM classification algorithm, is another important aspect of this research. This method takes full advantage of the powerful text representation anddimensionality reduction ability by LDA model and SVM strong classificationcapability, for each type of text sets LDA modeling, constructor LDA model, then useSVM algorithm training classifiers for all sub-LDA model, experimental results showthat this method is better than the traditional text classification techniques.

Keywords/Search Tags:

Text Classification, LDA model, Similarity Calculation, SVM Algorithm

PDF Full Text Request

Related items

1	Research And Implementation Of College Enrollment Question And Answer Service System Based On Deep Learning
2	Study On Chinese Text Classification Technology Based On Improved Text Similarity Algorithm
3	Forum Data Extraction Based On Similarity Calculation
4	Parallel Implementation On Document Classification And Similarity Analysis
5	Study Of Chinese Text Classification
6	Research And Application Of Text Similarity Calculation Method Based On Structured Representation Learning
7	Research On Short Text Similarity Calculation Method Based On Siamese Structural Model
8	Study On Similarity-based Text Clustering Algorithm And Its Application
9	Course Similarity Calculation Using Efficient Manifold Ranking
10	Research On Text Representation Model And Similarity Calculation Algorithm