Research On Clustering Of Search Results

Posted on:2019-12-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Liu

Full Text:PDF

GTID:2428330596960812

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

With the popularity and development of Internet techonology,network information is growing exponentially.Confronted with massive network information,search engine becomes an important tool for people to get network information.However,the traditional search engine arranges the search results in a linear order,so that users can not get interented information quickly and accurately.Consequently,clustering search results by clustering algorithm can help the users find what they want rapidly.Some text clustering techniques are deeply studied in this paper,such as chinese word segmentation,feature selection,weight caclulation and similarity measure.On the basis,the relationship between search results clustering and text cluatering is analyzed.According to the feature of search results clustering,a clustering method of search results based on improved K-Means algorithm is proposed.The partitioning-based K-Means algorithm is a widely applied dynamic clustering algorithm,which has the advantages of simple implementation and fast convergence.However,there are some weakness in the algorithm,which make it unable to adapt to the clustering of search results,such as artificially set cluster number,random generating initial point,unable to generate cluster tags and unable to realize soft clustering.Aiming at the shortcoming of K-Means algorithm,the following improvements are made in this paper.Firstly,the density based max-min distance means is used to find initial point,then according to the average similarity of all taxts,the parameters are set as the termination conditions,and the number of clusters is determined.Secondly,the improved algorithm introduce the notion of neighbor and get the neighbors of the initial points.Then,new initial points are calculated on the basis of the neighbors of the initial point and the outliers of the initial points are excluded.Finally,this paper filter the feature words in the clusters,and use the TF-IDF algorithm to calculate the weight of the feature words based on the cluster,and select the cluster tags according to the weight.When calculating the weight of feature words,the TFIDF algorithm only considers word frequency,but ignores the influence of POS and word length.Therefore,POS factor and length factor are introduced into the TF-IDF algorithm.Last,the experiment selects Nutch search engin to get search results,and get the text set of search results to be clustered through Jsoup prasing.Clustering experiments are done based on the improved K-Means algorithm and the experimental results show the improved K-Means algorithm compared to the original K-Means has better clustering results.The clustering results of the length factor and the POS factor TF-IDFalgorithm and the original TF-IDF algorithm are compared.The results show that the length factor and the POS factor will have a positive effect on the clustering results.

Keywords/Search Tags:

Search Results, K-Means Algorithm, The Initial Point, Dendity, Neighbor

PDF Full Text Request

Related items

1	Stacked Hashing Quantization Algorithm For Nearest Neighbor Search
2	Research On Bitmap Representation And Improvement Of Clustering Algorithm For Search Results
3	Study On Problems To Select Initial Cluster Centers Of The K-means Algorithm
4	Based On The Selection Of The Initial Point Of K-means Clustering Algorithm And Its Application
5	Research On Search Results Clustering Technology For Cloud Search Engine
6	Research On Optimization And Parallel Of K-means Algorithm On Spark
7	Location-Aware Based Neighbor Network Construction Algorithm And P2P Neighbor Search In Maze
8	Research On Hashing Accelerated Approximate Nearest-Neighbor Search
9	Improvement Of K-means Algorithm And Its Application In The Text Data Cluster
10	The Research And Application Of Text Clustering Based On Improved K-means Algorithm