Application And Research Of Web Document Clustering In Search Engine

Posted on:2010-07-14

Degree:Master

Type:Thesis

Country:China

Candidate:X F Yuan

Full Text:PDF

GTID:2178360275451085

Subject:Computer application technology

Abstract/Summary:

With the explosive increase of internet data,Search engine technology has been widely researched,and a number of excellent search engines are emerged.However,the current search engines only arrange a simple linear array for the returned searched results.The information which users really want may be submerged in a huge returned list of results,bringing great inconvenience to users.This paper is committed to cluster the results returned from the search engine,and the results are organized to the hierarchy structure.The similarity between the documents of the different cluster is as small as possible.Each cluster is labeled as a good description in order to facilitate users to browse and reduce the time for users to find the results.Through the research on the current main clustering algorithm,an improved algorithm STC-I based on algorithm STC has been devised.The algorithm STC-I is introduced to conquer the two flaws of algorithm STC, which are term space dimension is too high and the correlation between keyword query and document are not calculated,respectively.STC-I algorithm removed synonyms,near-synonym to reduce dimensionality of the document set,thus reducing of the algorithm.Calculating of documents relevant and not clustering with the lower correlation is to enhance the clustering.The experiment proves this algorithm is improved largely both in time complexity and the clustering accuracy.For the main reference factor for classifying the documents is the thesis of documents,a clustering method--HTBC is devised.It extracts the keywords according to the title and the body of the document,trains the text sets to generate the word clustering,classifies each keyword to some word cluster,combines the same thesis attribute to word cluster and finally realizes clustering.There are four steps for HTBC such as pretreatment, constructing the theme vector,generating the word cluster and theme clustering.The experimental data represents HTBC are better than K-Means,AHC and STC in terms of accuracy and recall ratio.Finally,Search engine system with a clustering module is developed based on the above research.The system includes Web crawlers,index system and Retrieval System with a clustering module in which the algorithm HTBC is applied.Through the analysis of the system operating, the design of system is proved to be reasonable.

Keywords/Search Tags:

Document Clustering, Search Engine, STC, Mutual Information, Theme

Related items

1	Design And Implementation Of Intranet Search Engine In Court
2	The Design And Implementation Of Chinese Personal Name Search Engine
3	Document Clustering In Search Engine
4	The Study On Search Strategy And Algorithm Design Of Theme Search Engine
5	Based On The Theme Of The Web Information Extraction And Intelligent Search Technology Research And To Achieve
6	Real Estate-oriented Vertical Search Engine Research And Implementation
7	Based On The Theme Of The News Search Engine Research And Realization
8	Research On Focused Search Engine For Military
9	The Study Of Extracting Feature Words On Chinese Search Engine
10	Application And Research Of Document Cluster In Web Results Of Search Engine