Font Size: a A A

Improvement Of K-means Algorithm And Its Application In The Text Data Cluster

Posted on:2017-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y J WangFull Text:PDF
GTID:2348330509463450Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of computer skills, especially the project of the "Internet +" and the cloud platform used in various industries recently, various of data grows quickly, and huge information often behind these data, the traditional method to retrieval and analysis can't satisfy our needs of receiving useful information, those management model doesn't suit to today's data management. In this case, the technology——Data Mining has become the most useful method to quickly obtain important information in daily life. Cluster analysis as a typical description statistics method of the unsupervised machine learning, attracts wide attention.K-means clustering algorithm is a dynamic clustering algorithm based on partition,Because of its simple and easy operation, the algorithm has been widely used, but it also has some weakness: It is sensitive to outliers, only useful for the ball shaped clusters and its like.In these weaknesses, the running time and the effect of algorithm highly depends on the selection of the initial point requirements, besides, there isn't a unified approach to outliers and initial point selection currently. Considering such problems, this paper mainly finishes some task as follows:(1)In view of the effect of outliers in experiments and the definition of standard score as well as stand deviation in statistical, the difference between some data will reduce after the process of standard score and deviation, this paper propose a method using the standard score and deviation to removal strong outliers. Then according to the advantage of K-center algorithm with higher robustness contrast K-means, this paper proposed an idea changing the principle of traditional distance product method, once again, due to the statistical score and deviation are often used to measure the degree of dispersion of data points, this paper proposed the idea to replace the data in highest density with the data in the minimum deteriorate score,at last in order to improve the applicability of the algorithm, this paper introduces theouter-class-distance and improve the cluster guideline, carrying on the simulation to verify the feasibility of the algorithm with Iris?Wine?Balance-Scale?Glass in UCI database.(2)In application aspects, due to the influence of ‘the Internet +', the text data is in‘explosive growth' in recent years. Especially text data often appear in people's daily life with more information. So this paper focuses on text cluster. According to the specific character of text data, as well as some achievement about text cluster, this paper tries to introduce the idea utilizing the standard deviation to delete outliers and select the initial points. Experiment shows that, the improved algorithm has certain increase in performance than traditional K-means.
Keywords/Search Tags:K-means, Standard Score, Standard Deviation, The Initial Point, The Text Clustering
PDF Full Text Request
Related items