Font Size: a A A

Research On Classification Of Network News Text Based On T-distribution Mixture Model

Posted on:2024-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:J B NiuFull Text:PDF
GTID:2568307115453664Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of science and technology,Chinese text classification has become a hot topic in current research.At present,a large number of text classification algorithms have emerged,but these algorithms have some problems,such as low accuracy of classification and unbalanced accuracy among various categories.Similar to the idea of machine learning in other fields,text classification algorithm also enables the model to recognize and learn the characteristics of the data through the training set,so as to realize the text classification of the test set.Therefore,the quality of the training set data is crucial to the classification effect of the classifier.However,many real data have noise,text data is no exception.One of the reasons for the low accuracy of the existing text classification algorithm is the existence of noise in the data set.In natural language processing,text representation is an important step,through which documents can be digitized.However,most network news has problems such as word disorder,misspelled characters,repeated or missing characters,so the collected text data will generate noise in the process of document digitization.In order to improve the classification accuracy of text data,this paper proposes a text classification method based on the combination of T-distribution mixture model(TMM)and LDA topic model.Compared with Gaussian mixture model and machine learning text classification methods,T-distributed mixture model(TMM)is adopted as the classification method because of its thick tail characteristics and stronger resistance to noise data,while LDA topic model is used as a dimension reduction method in this paper.Specific research work is as follows:(1)The pre-processed text data set was digitized by TF-IDF method.After being digitized,the text feature matrix is a high-dimensional sparse matrix.In order to improve the implementation efficiency of the classifier,the dimension of the feature matrix is reduced.(2)In this paper,three different dimension reduction methods are combined with T-distribution mixture model(TMM)to compare the accuracy of classification results and the efficiency of the algorithm.It is found that the combination of dimension reduction based on LDA topic model and T-distribution mixture model is more suitable.(3)On the basis of solving the parameters of Tdistributed mixture model(TMM)with EM algorithm and K-Means algorithm to improve the efficiency of parameter solving,a new text classification method using mixture model for text classification is constructed.(4)The effectiveness of the proposed method is verified by comparing the classification results with the machine learning classification method and Gaussian mixture model on different scale data sets.It presents new ideas and methods for the study of network news text classification.
Keywords/Search Tags:news text classification, T-distribution mixture model, EM algorithm, LDA topic model
PDF Full Text Request
Related items