Font Size: a A A

Research On Bitmap Representation And Improvement Of Clustering Algorithm For Search Results

Posted on:2019-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:L S ChenFull Text:PDF
GTID:2428330596466416Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
People are more likely to acquire information through the Internet and search engines because of the rapid development of network technology and search technology,and search results' variety and disorder are the most important factors which influence users to obtain effective information quickly.Recently,most researches on search results clustering are focused on clustering methods and ignored the effect of the text representation method of the search results on the clustering effect and convergence rate.Moreover,as a widely used algorithm in search results clustering,the K-means algorithm's clustering effect and convergence rate are easily affected by the initial clustering centers.Whereas,there are still some problems in the current research achievements of initial clustering centers.Based on these,the thesis mainly studies the text representation method of search results and the initial clustering center of K-means algorithm,and the following jobs have been completed.(1)In view of the problems that vector space model has the disadvantages of time consuming of text similarity calculation and large memory space requirement,the bitmap text representation method which combines the advantages of vector space model and Boolean model is proposed to express the search results.In this method,the search results are expressed as feature vectors with the vector space model firstly,and then the feature weights are converted into corresponding Boolean values.Finally,the Boolean values are stored in units of bits to form a bitmap feature vector.The results demonstrate that the bitmap text representation method can accelerate the computing speed of text similarity and reduce the storage space needed.(2)In view of the problems that random selection of initial clustering centers for K-means algorithm leads to the non-repeatable implementation.At the same time,the algorithm may fall into the local optimal solution and slow down the convergence rate because of the initial clustering centers.An initial clustering center selection algorithm based on pessimistic criterion and k-Nearest neighbor algorithm is presented.This method firstly identifies K farthest data as the candidate initial cluster centers by using the pessimistic criterion,and then finds out the k nearest neighbors of these data respectively,finally,takes the center of each neighbor cluster as the initial clustering centers.The results indicate that initial clustering center selection algorithm based on pessimistic criterion and k-Nearest neighbor algorithm can availably raise the clustering effect and convergence rate of K-means algorithm while guaranteeing the stability of the algorithm.(3)In view of the problems that cluster search engines exist,a prototype system of clustering search engine based on the fruits of this thesis is designed and implemented.The prototype system includes the acquisition of search results,the preprocessing of search results,text representation of search results,initial clustering centers selection,search results clustering and the display of search results.The running results of the prototype system show that applying the fruits of this thesis to search results clustering can effectively improve the efficiency of user acquisition of information and enhance the user experience.At the same time,it also provides a reference value for the future development of similar clustering search engine system.
Keywords/Search Tags:Search Results, Clustering, Bitmap, Pessimistic Criterion, Nearest Neighbor Algorithm
PDF Full Text Request
Related items