Font Size: a A A

Research On Improvement Of K-means Algorithm For Micro-blogging Information

Posted on:2018-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhaoFull Text:PDF
GTID:2348330542987344Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Micro-blog,as a modern social media which is focused rifely and used widely,has integrated into daily life from different aspects.Micro-blog brings tens of thousands pieces of information,including blogger's information of homepage,the information of hot topic,and the review information about the blog article,but sometime it seems to be intricate.Now,how to find information of interest and valuable information from lots of micro-blog information becomes the content of many people's research.In view of this content,text clustering theory in data mining field undoubtedly provides a good solution.K-means algorithm,an unsupervised text clustering algorithm,is applied to text clustering,whose clustering effect is obvious and fast.But the research shows that there are some obvious problems or defects in the algorithm,which mainly focus on the determination of the number of clusters,the choice of initial clustering centers and the effect of isolated points on clustering effect.In order to solve the above problems of K-means algorithm,this paper takes the micro-blog information as the data set,after the text prepossessing,makes the K-means algorithm improvement,and carries on the massive experiment to carry on the verification.The main contents of this paper are as follows:(1)The production process of Micro-blog data-set.In order to obtain the micro-blog information dataset needed in this paper,the author focuses on the studies of web-page information fetching technology and related tools,and then thousands of data were fetched successfully,and a series of prepossessing operations were done to the data,including the segmentation,stop words,feature selection and vector representation which are required to make data-sets for this text.(2)As for the random problems in the clustering algorithm of traditional K-means,the distance matrix between the texts and its standard deviation based on the systematical comprehension of the basic principle of K-means algorithm is constructed in this paper.And the first initial clustering center is selected by analyzing the standard deviation,at the same time;the rest initial clustering centers are selected according to the distance.(3)After choosing the first initial cluster center,according to the principle of "the distance is bigger,the text similarity is lower",the text object furthest away from the first initial cluster center is determined as the second initial cluster center,then The text objects furthest away from the first two clustering centers are selected as the third initial clustering center,and so on until the initial cluster centers are selected.(4)According to the mutual information between the feature words and the categories,this paper constructs the mutual information between the text information and the Euclidean distance(Euclidean distance)and as a measure of similarity,it can improve the accuracy of clustering effect to a certain extent.In the end,the paper summarizes the main contents,the improvement points and the process of the experiment,as well as expatiates the development direction of K-means algorithm and its future research.
Keywords/Search Tags:Micro-blog information, K-means, initial clustering center, mutual information, Euclidean distance
PDF Full Text Request
Related items