Font Size: a A A

Research Of Chinese Short Text Clustering Method Based On Improved CHIR-TCFS Algorithm

Posted on:2019-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y J WangFull Text:PDF
GTID:2428330548976375Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,China has been in the information and digital era.As an important carrier of information,text mining plays an increasingly important role in various fields.In the process of text mining,clustering is a very important and common method.The purpose of clustering is to be intuitively described as follows: The samples with high similarity are grouped into the same category and the samples with low similarity are grouped into different categories,that is,according to the similarity degree among different samples,the sample set is divided into several categories.Clustering process is unsupervised,it does not need artificial labeled training samples,so clustering is more simple and intelligent,and it is widely used in text processing,pattern recognition and other fields,which has attracted more and more researchers' attention.In recent years,short texts have appeared in many fields such as instant messaging,micro-blog reviews,business complaints and so on.Short texts have the characteristics of high degree of refinement,few words,extensive fields,various forms and large quantities.And the traditional text clustering method in short texts cannot achieve satisfactory results,so the clustering of short texts has gradually become a major challenge in the field of text mining.At present,there are already some methods for clustering texts of shorter length.Among them,the CHIR-TCFS(CHIR-Text Clustering with Feature Selection)algorithm is based on the clustering method of selecting features by chi-square test.At the same time,the algorithm solves the problem that need to be supervised in the process of feature selection and has good clustering effects.Firstly,aiming at the deficiency of feature selection of the clustering algorithm,this paper improves the feature selection method based on information gain and the feature selection method based on TF-IDF,and proposes a first improved CHIR-TCFS algorithm.The improved algorithm effectively improves the accuracy of feature selection and at the same time makes up for the defect that the original clustering algorithm can not effectively deal with the non-equilibrium data set.In addition,aiming at the defect of the original algorithm when choosing the initial cluster centers,this paper presents a method of average distribution of initial cluster centers based on sample points,solving the problem that the original clustering center is too close,and improving the accuracy of the sample initial category label.Afterwards,we verify the effectiveness of this method through experiments.Finally,based on the first improved CHIR-TCFS algorithm,in order to improve the efficiency of clustering algorithm,make it can still show a good clustering effect for larger scale text sets,we propose a method to calculate the similarity index of feature exclusion with weak characterization ability.At the same time,we make full use of the advantage of stronger characteristic features in texts,and propose a fast convergence clustering algorithm,which makes each clustering center be able to approach the final actual cluster center at a faster rate.The results of comparative experiments designed in this paper show that fast convergence clustering algorithm has higher efficiency and better clustering effect.
Keywords/Search Tags:short text, clustering, CHIR-TCFS, unbalanced dataset, information gain, TF-IDF
PDF Full Text Request
Related items