Chinese Text Classification Method Based On Improved Topic Model

Posted on:2019-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:X Li

Full Text:PDF

GTID:2428330563992457

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

The fast development of Internet technology has increased the amount of data on the Internet.More and more people pay a lot of attention to the way of getting useful information from the vast data.At present,the text classification technology is a key method of managing the vast data efficiently.Obviously,the majority of vast data is unstructured and has the characteristics of high dimension and sparsity,which adds difficulty to text classification.Therefore,how to reduce the dimension and improve the performance of text classifier is a significant research topic in the domain of natural language processing.The main work is as follows:A PSC-LDA model which is based on the part of speech and its combination is proposed.The PSC-LDA model takes the differences in contribution of different parts of speech to semantic expression in Chinese into account.By dividing the whole text set into four parts,namely noun set,noun-verb combination set and other words set which is a combination of objective and adverb words,the PSC-LDA model is created by building models on the four data sets and uses Gibbs sampling algorithm to estimate parameters indirectly.And then,the text-topic mixed probability distribution of each data set is obtained.Based on the text classification corpus provided by Li ronglu of Fudan University,the optimal word set and optimal topic number of PSC-LDA model are determined by experiments,and the experiment results show the modeling time of PSC-LDA model is reduced by 39.44 percent and the dimension of training data required for modeling is reduced by 37.74 percent compared to the standard LDA model.A PSC-LDA_SVM method which is a multi-class classification method for text data and is based on PSC-LDA model and SVM algorithm is proposed.The PSC-LDA_SVM method can effectively extract potential topic information from large scale text data,and it has the ability to represent features and reduce dimensions.Additionally,it can solve the problem of linear inseparability and local optimum.Based on this,the PSC-LDA_SVM method is compared with PSC-LDA_KNN method,LDA_SVM method and VSM_SVM method in the performance of text classification.The value of macro precision rate of PSC-LDA_SVM method is higher than other three methods,which is 4.6 percent,4.3percent and 5.3 percent respectively.The macro recall rate of PSC-LDA_SVM method is higher than other three methods,which is 4.9 percent,5.5 percent and 7.1 percent respectively,and the value of macro₁ of PSC-LDA_SVM method is higher than other three methods,which is 4.9 percent,5.1 percent and 6.5 percent respectively.

Keywords/Search Tags:

Data mining, Multi-class classification, Feature extraction, Latent dirichlet allocation, Support vector machine

PDF Full Text Request

Related items

1	Research And Application Of Text Classification Model Based On Topic Model
2	Research And Implementation Of Spark-based Text Classification
3	Aurora Image Classification Based On Multi-Feature Latent Dirichlet Allocation
4	Design And Implementation Of Content-based Webpage Collection And Classification System
5	Design And Implementaion Of Finance News Classification System Based On Labeled-LDA
6	Text Classification Research Based On Support Vector Machine
7	Research On Image Classification Based On Support Vector Machine
8	Research On Support Vector Machine Classification Algorithm For Multi-class Texts
9	Research On Feature Selection And Multi-Class Classification Methods Based On Twin Support Vector Machine
10	Public Opinion Events Active Detection Research On Microblogging