Font Size: a A A

Chinese Text Classification Method Based On Improved Topic Model

Posted on:2019-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2428330563992457Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
The fast development of Internet technology has increased the amount of data on the Internet.More and more people pay a lot of attention to the way of getting useful information from the vast data.At present,the text classification technology is a key method of managing the vast data efficiently.Obviously,the majority of vast data is unstructured and has the characteristics of high dimension and sparsity,which adds difficulty to text classification.Therefore,how to reduce the dimension and improve the performance of text classifier is a significant research topic in the domain of natural language processing.The main work is as follows:A PSC-LDA model which is based on the part of speech and its combination is proposed.The PSC-LDA model takes the differences in contribution of different parts of speech to semantic expression in Chinese into account.By dividing the whole text set into four parts,namely noun set,noun-verb combination set and other words set which is a combination of objective and adverb words,the PSC-LDA model is created by building models on the four data sets and uses Gibbs sampling algorithm to estimate parameters indirectly.And then,the text-topic mixed probability distribution of each data set is obtained.Based on the text classification corpus provided by Li ronglu of Fudan University,the optimal word set and optimal topic number of PSC-LDA model are determined by experiments,and the experiment results show the modeling time of PSC-LDA model is reduced by 39.44 percent and the dimension of training data required for modeling is reduced by 37.74 percent compared to the standard LDA model.A PSC-LDASVM method which is a multi-class classification method for text data and is based on PSC-LDA model and SVM algorithm is proposed.The PSC-LDASVM method can effectively extract potential topic information from large scale text data,and it has the ability to represent features and reduce dimensions.Additionally,it can solve the problem of linear inseparability and local optimum.Based on this,the PSC-LDASVM method is compared with PSC-LDAKNN method,LDASVM method and VSMSVM method in the performance of text classification.The value of macro precision rate of PSC-LDASVM method is higher than other three methods,which is 4.6 percent,4.3percent and 5.3 percent respectively.The macro recall rate of PSC-LDASVM method is higher than other three methods,which is 4.9 percent,5.5 percent and 7.1 percent respectively,and the value of macro1 of PSC-LDASVM method is higher than other three methods,which is 4.9 percent,5.1 percent and 6.5 percent respectively.
Keywords/Search Tags:Data mining, Multi-class classification, Feature extraction, Latent dirichlet allocation, Support vector machine
PDF Full Text Request
Related items