Font Size: a A A

Research On News Cluster Algorithm Based On Improved K-Means

Posted on:2020-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:M T ZhangFull Text:PDF
GTID:2428330599460289Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of various online media,it's more and more convenient to obtain news information.However,it causes massive data accumulation.How to obtain potential information from massive data and apply it has been an important research area.Cluster analysis is one of the important methods of data mining and it is widely applied in bioinformatics,finance,medicine and so on.This paper mainly focuses on the news clustering algorithm,and proposes a news clustering algorithm based on improved K-Means.Firstly,the concept of TI value is proposed,which combines with the structural features of news.The TI value is based on the TF-IDF value of the content word,which combines with the headline and the guide of the news.It is used to extract the text feature vector,so that the text feature vector is more representative and the clustering effect is improved.Secondly,for the high time complexity and unstable clustering effect of maximum distance algorithm,it is optimized and combined with TI value to form TIM_K-Means algorithm.TIM_K-Means algorithm uses the TI value to construct the text feature vector and changes the way the distance is calculated in the maximum distance algorithm,so that it could reduce the time complexity of the algorithm.In addition,isolated point detection is added in the initial center selection process so that isolated points can be removed during the iterative process,resulting in a more reasonable initial clustering center.Thirdly,in order to solve the serious problem that the algorithm takes so much time to deal with massive data,the algorithm is parallelized.MapReduce programming model is used for parallel transformation to enable the algorithm to run on the Hadoop platform.Finally,using the accuracy and variance as the metrics,experiments on the text datasets crawled from Tencent News are performed to verify the correctness and validity of the TI value and TIM_K-Means algorithm.Besides,experiments are performed in Hadoop cluster built on Alibaba Cloud servers,and the feasibility of parallel transformation of TIM_K-Means algorithm is verified utilizing acceleration ratio and scalability as the metrics.
Keywords/Search Tags:News cluster, TI value, TIM_K-Means, Paralization
PDF Full Text Request
Related items