Font Size: a A A

Research On Clustering Of Search Results

Posted on:2019-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiuFull Text:PDF
GTID:2428330596960812Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the popularity and development of Internet techonology,network information is growing exponentially.Confronted with massive network information,search engine becomes an important tool for people to get network information.However,the traditional search engine arranges the search results in a linear order,so that users can not get interented information quickly and accurately.Consequently,clustering search results by clustering algorithm can help the users find what they want rapidly.Some text clustering techniques are deeply studied in this paper,such as chinese word segmentation,feature selection,weight caclulation and similarity measure.On the basis,the relationship between search results clustering and text cluatering is analyzed.According to the feature of search results clustering,a clustering method of search results based on improved K-Means algorithm is proposed.The partitioning-based K-Means algorithm is a widely applied dynamic clustering algorithm,which has the advantages of simple implementation and fast convergence.However,there are some weakness in the algorithm,which make it unable to adapt to the clustering of search results,such as artificially set cluster number,random generating initial point,unable to generate cluster tags and unable to realize soft clustering.Aiming at the shortcoming of K-Means algorithm,the following improvements are made in this paper.Firstly,the density based max-min distance means is used to find initial point,then according to the average similarity of all taxts,the parameters are set as the termination conditions,and the number of clusters is determined.Secondly,the improved algorithm introduce the notion of neighbor and get the neighbors of the initial points.Then,new initial points are calculated on the basis of the neighbors of the initial point and the outliers of the initial points are excluded.Finally,this paper filter the feature words in the clusters,and use the TF-IDF algorithm to calculate the weight of the feature words based on the cluster,and select the cluster tags according to the weight.When calculating the weight of feature words,the TFIDF algorithm only considers word frequency,but ignores the influence of POS and word length.Therefore,POS factor and length factor are introduced into the TF-IDF algorithm.Last,the experiment selects Nutch search engin to get search results,and get the text set of search results to be clustered through Jsoup prasing.Clustering experiments are done based on the improved K-Means algorithm and the experimental results show the improved K-Means algorithm compared to the original K-Means has better clustering results.The clustering results of the length factor and the POS factor TF-IDFalgorithm and the original TF-IDF algorithm are compared.The results show that the length factor and the POS factor will have a positive effect on the clustering results.
Keywords/Search Tags:Search Results, K-Means Algorithm, The Initial Point, Dendity, Neighbor
PDF Full Text Request
Related items