Font Size: a A A

Research And Its Application Of Web Short Text Clustering Method Based On K-Means Algorithm

Posted on:2017-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:L S ZhangFull Text:PDF
GTID:2348330491957956Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Now,Internet has become the premier platform for people to get all kinds of information and to communicate with each other.With the development and popularity of the network,short web texts,such as the comments,have been flooded in the network environment.Those short texts contain people's viewpoints and opinions for news,online shopping and other objects.It's the most convenient and comprehensive approach to dig the public view.Short web texts,appearing in large numbers,have features with disorganized and random.We need to use text clustering technology to dig useful information from this massive short texts.Owning to the traditional text clustering method is only applicable to ordinary long texts,and can not cluster short texts effectively according to the characteristics of web short texts,we need to improve the existing cluster method to deal with short text.Combining with Bootstrapping algorithm in a cloud computing environment,this thesis proposed a web short text clustering method based on K-means algorithm,according to the characteristics of web short texts.Details are as follows:(1)Introduce the necessity of web short text clustering study by describing the text clustering development status and development trends at home and abroad.Explain related technologies of text clustering,discuss the current text feature item extraction method,introduce the advantages of Bootstrapping algorithm,and propose a method to extract text feature item in network information by using Bootstrapping algorithm.By studying the method for text feature item representation,we improved the common TFIDF weight calculation formula and gave web short text weight by using improved formula to the highlight text features.By studying classical K-means clustering algorithm,we proposed a web short text clustering method based on K-means clustering algorithm to solve the shortcomings of K-means algorithm in dealing with web short texts.(2)Introduce the advantages of cloud computing,describe the main idea and advantages of our method in Hadoop platform,and verify accuracy and high efficiency of our method through experiments.In this thesis,the main innovation points as follows:(1)Using Bootstrapping algorithm to extract text feature items of network information,which avoided the limitations of artificial properties selection,so that the properties of the selectedphrases are more comprehensive and representative.(2)Improved TFIDF formula to markup text feature items.The improved formula can calculate the weight according to the characteristics of web short texts.By highlighting text feature,improved the accuracy of the clustering results.(3)Considering of the characteristics of Web short texts,the method of K-means algorithm selecting the initial cluster centers was changed and improved to make clustering results more accurate.(4)Combined with cloud computing,by using Hadoop cluster to meet the large amounts of data clustering needs,the operation efficiency was improved.
Keywords/Search Tags:Text clustering, Web short text, K-means algorithm, Cloud computing
PDF Full Text Request
Related items