Font Size: a A A

Application And Research Of Web Document Clustering In Search Engine

Posted on:2010-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:X F YuanFull Text:PDF
GTID:2178360275451085Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive increase of internet data,Search engine technology has been widely researched,and a number of excellent search engines are emerged.However,the current search engines only arrange a simple linear array for the returned searched results.The information which users really want may be submerged in a huge returned list of results,bringing great inconvenience to users.This paper is committed to cluster the results returned from the search engine,and the results are organized to the hierarchy structure.The similarity between the documents of the different cluster is as small as possible.Each cluster is labeled as a good description in order to facilitate users to browse and reduce the time for users to find the results.Through the research on the current main clustering algorithm,an improved algorithm STC-I based on algorithm STC has been devised.The algorithm STC-I is introduced to conquer the two flaws of algorithm STC, which are term space dimension is too high and the correlation between keyword query and document are not calculated,respectively.STC-I algorithm removed synonyms,near-synonym to reduce dimensionality of the document set,thus reducing of the algorithm.Calculating of documents relevant and not clustering with the lower correlation is to enhance the clustering.The experiment proves this algorithm is improved largely both in time complexity and the clustering accuracy.For the main reference factor for classifying the documents is the thesis of documents,a clustering method--HTBC is devised.It extracts the keywords according to the title and the body of the document,trains the text sets to generate the word clustering,classifies each keyword to some word cluster,combines the same thesis attribute to word cluster and finally realizes clustering.There are four steps for HTBC such as pretreatment, constructing the theme vector,generating the word cluster and theme clustering.The experimental data represents HTBC are better than K-Means,AHC and STC in terms of accuracy and recall ratio.Finally,Search engine system with a clustering module is developed based on the above research.The system includes Web crawlers,index system and Retrieval System with a clustering module in which the algorithm HTBC is applied.Through the analysis of the system operating, the design of system is proved to be reasonable.
Keywords/Search Tags:Document Clustering, Search Engine, STC, Mutual Information, Theme
PDF Full Text Request
Related items