Research On Classification Of Network News Text Based On T-distribution Mixture Model

Posted on:2024-03-13

Degree:Master

Type:Thesis

Country:China

Candidate:J B Niu

Full Text:PDF

GTID:2568307115453664

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

With the development of science and technology,Chinese text classification has become a hot topic in current research.At present,a large number of text classification algorithms have emerged,but these algorithms have some problems,such as low accuracy of classification and unbalanced accuracy among various categories.Similar to the idea of machine learning in other fields,text classification algorithm also enables the model to recognize and learn the characteristics of the data through the training set,so as to realize the text classification of the test set.Therefore,the quality of the training set data is crucial to the classification effect of the classifier.However,many real data have noise,text data is no exception.One of the reasons for the low accuracy of the existing text classification algorithm is the existence of noise in the data set.In natural language processing,text representation is an important step,through which documents can be digitized.However,most network news has problems such as word disorder,misspelled characters,repeated or missing characters,so the collected text data will generate noise in the process of document digitization.In order to improve the classification accuracy of text data,this paper proposes a text classification method based on the combination of T-distribution mixture model(TMM)and LDA topic model.Compared with Gaussian mixture model and machine learning text classification methods,T-distributed mixture model(TMM)is adopted as the classification method because of its thick tail characteristics and stronger resistance to noise data,while LDA topic model is used as a dimension reduction method in this paper.Specific research work is as follows:(1)The pre-processed text data set was digitized by TF-IDF method.After being digitized,the text feature matrix is a high-dimensional sparse matrix.In order to improve the implementation efficiency of the classifier,the dimension of the feature matrix is reduced.(2)In this paper,three different dimension reduction methods are combined with T-distribution mixture model(TMM)to compare the accuracy of classification results and the efficiency of the algorithm.It is found that the combination of dimension reduction based on LDA topic model and T-distribution mixture model is more suitable.(3)On the basis of solving the parameters of Tdistributed mixture model(TMM)with EM algorithm and K-Means algorithm to improve the efficiency of parameter solving,a new text classification method using mixture model for text classification is constructed.(4)The effectiveness of the proposed method is verified by comparing the classification results with the machine learning classification method and Gaussian mixture model on different scale data sets.It presents new ideas and methods for the study of network news text classification.

Keywords/Search Tags:

news text classification, T-distribution mixture model, EM algorithm, LDA topic model

PDF Full Text Request

Related items

1	Research And Application Of Text Classification Model Combining Character Features And Topic Features
2	Research Of Method Based On The Topic Model On News Headlines Classification
3	A Text Classification Algorithm Based On Statistical Manifold Learning
4	Research On Semi-supervised Topic Model For Text Classification
5	Text Classification Algorithm Based On Chinese And English Topic Space
6	Study On Text Classification Based On Finite Mixture Model
7	The Study Of Short Text Classification Based On Ada Boost-GASVM Algorithm And LDA Topic Model
8	Technology Research Of Forestry Information Text Classification Based On Gauss Mixture Model
9	Research On The Text Classification Method Based On Correlated Topic Model
10	Research On Classificational Model Of Text Sentiment Based On Topic