Research On The Duplicated Web Pages Detection Algorithms Of Search Engine

Posted on:2013-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:F L Yan

Full Text:PDF

GTID:2248330377958336

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the popularity and rapid development of Internet, web information increasesexponentially. Search engines become efficient tools to help users to obtain valuableinformation from the Web. It is so easy to post information on the Internet that there are manyduplicated and near duplicated web pages on the Internet. These duplicated web pages bring alot of drawbacks to the search engines, such as affecting user experience、waste of crawlingand storage resources、 increasing the inverted index lists and reducing the efficiency ofretrieval, so detection of duplicated and near duplicated web pages can effectively improvethe quality of search engines.In recent years major search engine companies and many scholars, at home or abroad,have brought up many duplicated and near duplicated web pages detection algorithms, such ascharacteristics-based algorithms、 I-Match algorithm、 terms-based algorithms、 DigitalSyntactic Clustering and so on. This thesis analyzes the advantages and disadvantages oftraditional algorithms and we find that the common thought of these algorithms is first toextract certain information from the text, and then computing the similarity. These algorithmsare different from the methods that how to extract information from the text, so the similaritycomputation is also different. In order to improve the efficiency of the similarity computation,some algorithms compress the certain information. It can be seen from the former analysisthat to extract effective information, which can accurately represent the text, from the text, isthe key factor of computing the similarity of duplicated web pages.This thesis analyzes the advantages and disadvantages of two classic duplicate webpages detection algorithms, and based on the shortcomings of the algorithms we propose twoimproved algorithms. The main works of this paper as follows:(1) An improved duplicated web pages detection algorithm based on DSCDSC(Digital Syntactic Clustering) algorithm is one of the classical algorithmsfor duplicated web pages detection. Its basic idea is to cut into text a certain number ofshingles, from which some shingles are chosen to participate in the similarity comparison.The disadvantage of this algorithm is that shingles are randomly selected so it didn’t makefull use of the contents of the text features. To the deficiency of it, the improved algorithm maintains a terms set and use the set to choose shingles. The shingles participated in thesimilarity comparison can take advantage of the structure and content of the text.(2) An improved duplicated web pages detection algorithm based on Terms matchingThe duplicated web pages detection algorithm based on Terms matching firstly extractthe terms from the text using the TFIDF algorithm and then the text is present as a termVector Space Model. Finally the cosine formula is used for the similarity determination. Thedisadvantage of TFIDF algorithm is that it didn’t make full use of the terms locationinformation in the text when computing the weights of terms. Through the observation of webpages, we found that the contents of web pages are shorter and many of which containheadlines. These headlines are the briefest summaries of the contents. This character isapplied to compute the weights of terms.(3) The performance evaluation of the improved algorithmsIn order to evaluate the improved algorithms, we implement a prototype search enginesystem which is based on the Lucene, an open source index retrieval tools. The experimentalresults show that the improved algorithms have better recall rate and precision rate in thedetection of duplicated and near duplicated web pages than the original algorithms.

Keywords/Search Tags:

Search Engine, Duplicated Web Pages Detection, Digital SyntacticClustering, Terms, Lucene

PDF Full Text Request

Related items

1	Research And Implementation On Removing Duplicated Web Pages Of Search Engine System
2	Research On Vertical Search Engine
3	Research On Key Techniques Of Vertical Search Engine Based On Lucene
4	Research On Results Merging Algorithm In Meta Search Engine
5	Research On Results Merging Technology In Meta Search Engine
6	The Design And Implementation Of Vertical Search Engine Based On Duplicated Web Pages Elimination
7	Research And Implementation On Removing Duplicated WebPages Algorithm Of Search Engine
8	Research And Implementation Of The Small-scale Search Engine Based On Lucene
9	Research On The Algorithm For Chinese Duplicated Web Pages Detection
10	The Design And Implementation Of Lucene-Based Digital Product Vertical Search Engine