Research On Duplicate Webpage Detection Technology In Search Engine

Posted on:2012-05-24

Degree:Master

Type:Thesis

Country:China

Candidate:R Tang

Full Text:PDF

GTID:2218330344450972

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, everyone is enabled to access a vast amount of information. How to retrieve the correct and intended information in a timely manner becomes a critical problem. Search engine is such a tool to provide useful query results. In practice, however, heavily duplicated web pages always exist in the query results. The duplication not only decreases the query efficiency of a search engine and wastes memory space for buffering query results, but also significantly degrades user experience. Duplication detection is a desired technique for improving query efficiency and service quality of search engines, which is the main topic of the thesis.In the thesis, take search engine's duplication web pages detection question as the research background. First a detailed survey of prior research results is given, with the summary and comparison of different algorithms; then proposed one based on the semantic duplication web pages detection algorithm. The algorithm main optimizing both feature extraction and comparisons. More specifically, in the text pretreatment, according to the words has the abundant semantic relations increased the synonym and liquid word combination; the feature extraction relies on the semantics context of word. Compared to conventional algorithms, weighting factors of position and length of key words are added. Furthermore, a binary tree sort is used in feature comparison, which greatly improves the efficiency compared to conventional pair-wise comparison algorithm. And how sort the extraction key word, also proposed two method, one kind according to the key word weight sorting method is called SORTw(Kd); another according to the key word letter sorting method is called SORTa(Kd).To verify the efficacy and efficiency of the proposed algorithm, a simulation environment is built on Windows OS. Simulation results reveal that the proposed algorithm has the advantages of relatively high precision, high recall rate and low time-space complexity. They are promising for practical applications after further improvements.

Keywords/Search Tags:

Duplication Detection, Feature Extraction, Feature Comparisons, MD5, Binary Tree

PDF Full Text Request

Related items

1	Research On Feature Description Method Of Imige With Noise
2	Research On Local Invariant Feature Extraction Of Images And Its Application
3	Research On Texture Feature Extraction And Automatic Classification Algorithms
4	Research Of Video Object Extraction Based On Support Vector Machine
5	The Research Of Feature Extraction In Facial Images
6	Product Feature Extraction Algorithm Based On Tree Structure
7	Analysis Of The Application In Moving Target Detection Based On Feature Extraction
8	Face Recognition Based On Wavelet Analysis And Feature Fusion
9	Research On Feature Extraction For Texture Image Recognition
10	Research On Detection Of Airport Shelter Based On Multi-Feature In Optical Remote Sensing Images