Research On Improvement Of K-means Algorithm For Micro-blogging Information

Posted on:2018-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhao

Full Text:PDF

GTID:2348330542987344

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Micro-blog,as a modern social media which is focused rifely and used widely,has integrated into daily life from different aspects.Micro-blog brings tens of thousands pieces of information,including blogger's information of homepage,the information of hot topic,and the review information about the blog article,but sometime it seems to be intricate.Now,how to find information of interest and valuable information from lots of micro-blog information becomes the content of many people's research.In view of this content,text clustering theory in data mining field undoubtedly provides a good solution.K-means algorithm,an unsupervised text clustering algorithm,is applied to text clustering,whose clustering effect is obvious and fast.But the research shows that there are some obvious problems or defects in the algorithm,which mainly focus on the determination of the number of clusters,the choice of initial clustering centers and the effect of isolated points on clustering effect.In order to solve the above problems of K-means algorithm,this paper takes the micro-blog information as the data set,after the text prepossessing,makes the K-means algorithm improvement,and carries on the massive experiment to carry on the verification.The main contents of this paper are as follows:(1)The production process of Micro-blog data-set.In order to obtain the micro-blog information dataset needed in this paper,the author focuses on the studies of web-page information fetching technology and related tools,and then thousands of data were fetched successfully,and a series of prepossessing operations were done to the data,including the segmentation,stop words,feature selection and vector representation which are required to make data-sets for this text.(2)As for the random problems in the clustering algorithm of traditional K-means,the distance matrix between the texts and its standard deviation based on the systematical comprehension of the basic principle of K-means algorithm is constructed in this paper.And the first initial clustering center is selected by analyzing the standard deviation,at the same time;the rest initial clustering centers are selected according to the distance.(3)After choosing the first initial cluster center,according to the principle of "the distance is bigger,the text similarity is lower",the text object furthest away from the first initial cluster center is determined as the second initial cluster center,then The text objects furthest away from the first two clustering centers are selected as the third initial clustering center,and so on until the initial cluster centers are selected.(4)According to the mutual information between the feature words and the categories,this paper constructs the mutual information between the text information and the Euclidean distance(Euclidean distance)and as a measure of similarity,it can improve the accuracy of clustering effect to a certain extent.In the end,the paper summarizes the main contents,the improvement points and the process of the experiment,as well as expatiates the development direction of K-means algorithm and its future research.

Keywords/Search Tags:

Micro-blog information, K-means, initial clustering center, mutual information, Euclidean distance

PDF Full Text Request

Related items

1	Improvement And Application Of K-means Algorithm
2	Research On Text Clustering Based On Division And Hierarchy
3	Research And Application Of Topic Detection On Micro-blog
4	The Research And Implementation Of Text Clustering Based On The Platform Of Micro-blog
5	Precise Clustering Algorithm For Chinese Text Based On K-means
6	Research On The Selection Of Initial Cluster Centers In K-means Algorithm
7	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm
8	Research On Text Clustering And Its Application In Topic Detection Analysis
9	Research Of Image Segmentation Algorithms Based On FCM Clustering
10	The Study And Development Of Hierarchical-K-means-Based Clustering Algorithm