Research On Text Classification Method Based On FCM Clustering

Posted on:2020-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:G B D Cai

Full Text:PDF

GTID:2438330590957907

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Clustering analysis is a main branch studying on �birds of feather flock together� of multivariate statistical analysis,developed from taxonomy(or numerical taxonomy),the basic science of human beings understanding the world.With the rapid development of information and computer sciences,the era of big data has arrived,bringing out this highly computerized and intelligent clustering analysis,which combined with data mining technologies,automatically gathers complex data into groups without determining classification criteria in advance and thus is more widely used in text categorization,machine learning,pattern recognition,image analysis and other fields.There are two branches of clustering: hard clustering and soft clustering.The former one is well-defined and has clear boundaries.The latter one,also called fuzzy clustering,can be seen as an extension of hard clustering.Fuzzy C-means(FCM)clustering is the most widely used soft clustering methods.The quality of FCM result mainly depends on cluster number and can be evaluated by the validity measure indicator.Therefore,the validity measure indicator is always considered as one factor of determining cluster number and serves as a weighing part in the improvement research on FCM algorithm.Text categorization is an application field of FCM,improvement research on this field focuses on a better classification by improving the performance of FCM algorithm instead of purely analysis of text data itself.This thesis is carried out based on these two reasons,and makes two contributions.First,this paper proposes a new validity measure indicator XB+,which includes more interclass separation information than index XB,so that to better choose cluster number.A simulation is carried out to compare the effect among XB+ and other four existing indicators by clustering 5 data sets with different data types.The result shows that the new indicator is better than existing indicators.Second,this thesis proposes an algorithm combining the latent Dirichlet allocation algorithm(LDA)and FCM algorithm,called LDA-FCM,to classify texts.Text data normally comprises of natural languages that can't be directly understood and processed by computer.Usually text data need to be vectorized before analysis,however,the vectorized text data are sparse and high-dimensional,and natural language itself is of ambiguity of word meaning(such as polysemy and multi-word a meaning)and the category.With LDA-FCM,data dimension and classification error caused by ambiguity can be reduced to a certain level.At the end of the paper,an empirical study on Chinese text categorization using LDA-FCM is given,and the classification effects of K-means clustering,FCM and LDA-FCM are compared.This empirical study shows that LDA-FCM has better performance than K-means and FCM.

Keywords/Search Tags:

Fuzzy C-Means Clustering, Validity Indicator, Latent Dirichlet Allocation, Text Classification

PDF Full Text Request

Related items

1	Research On The Key Techniques Of Chinese Text Clustering
2	Classification Algorithm For Social Text Stream
3	Research On Rough Classification Of Academic Papers Based On Topic And Semantic Fingerprint Fusion
4	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow
5	Studies On New Fuzzy Clustering Algorithms And Clustering Validity Problems
6	Design And Implementation Of A Text Recommender System Of Social Network Based On Latent Dirichlet Allocation
7	Aurora Image Classification Based On Multi-Feature Latent Dirichlet Allocation
8	Research And Implementation Of Spark-based Text Classification
9	The Research And Implementation Of Text Classification Based On Meta-information And Optimization
10	The Research And Implementation Of Text Classification Based On Meta-Information And Optimization