Font Size: a A A

Research On Text Classification Method Based On FCM Clustering

Posted on:2020-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:G B D CaiFull Text:PDF
GTID:2438330590957907Subject:Statistics
Abstract/Summary:PDF Full Text Request
Clustering analysis is a main branch studying on “birds of feather flock together” of multivariate statistical analysis,developed from taxonomy(or numerical taxonomy),the basic science of human beings understanding the world.With the rapid development of information and computer sciences,the era of big data has arrived,bringing out this highly computerized and intelligent clustering analysis,which combined with data mining technologies,automatically gathers complex data into groups without determining classification criteria in advance and thus is more widely used in text categorization,machine learning,pattern recognition,image analysis and other fields.There are two branches of clustering: hard clustering and soft clustering.The former one is well-defined and has clear boundaries.The latter one,also called fuzzy clustering,can be seen as an extension of hard clustering.Fuzzy C-means(FCM)clustering is the most widely used soft clustering methods.The quality of FCM result mainly depends on cluster number and can be evaluated by the validity measure indicator.Therefore,the validity measure indicator is always considered as one factor of determining cluster number and serves as a weighing part in the improvement research on FCM algorithm.Text categorization is an application field of FCM,improvement research on this field focuses on a better classification by improving the performance of FCM algorithm instead of purely analysis of text data itself.This thesis is carried out based on these two reasons,and makes two contributions.First,this paper proposes a new validity measure indicator XB+,which includes more interclass separation information than index XB,so that to better choose cluster number.A simulation is carried out to compare the effect among XB+ and other four existing indicators by clustering 5 data sets with different data types.The result shows that the new indicator is better than existing indicators.Second,this thesis proposes an algorithm combining the latent Dirichlet allocation algorithm(LDA)and FCM algorithm,called LDA-FCM,to classify texts.Text data normally comprises of natural languages that can't be directly understood and processed by computer.Usually text data need to be vectorized before analysis,however,the vectorized text data are sparse and high-dimensional,and natural language itself is of ambiguity of word meaning(such as polysemy and multi-word a meaning)and the category.With LDA-FCM,data dimension and classification error caused by ambiguity can be reduced to a certain level.At the end of the paper,an empirical study on Chinese text categorization using LDA-FCM is given,and the classification effects of K-means clustering,FCM and LDA-FCM are compared.This empirical study shows that LDA-FCM has better performance than K-means and FCM.
Keywords/Search Tags:Fuzzy C-Means Clustering, Validity Indicator, Latent Dirichlet Allocation, Text Classification
PDF Full Text Request
Related items