Font Size: a A A

Research On The Duplicated Web Pages Detection Algorithms Of Search Engine

Posted on:2013-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:F L YanFull Text:PDF
GTID:2248330377958336Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity and rapid development of Internet, web information increasesexponentially. Search engines become efficient tools to help users to obtain valuableinformation from the Web. It is so easy to post information on the Internet that there are manyduplicated and near duplicated web pages on the Internet. These duplicated web pages bring alot of drawbacks to the search engines, such as affecting user experience、waste of crawlingand storage resources、 increasing the inverted index lists and reducing the efficiency ofretrieval, so detection of duplicated and near duplicated web pages can effectively improvethe quality of search engines.In recent years major search engine companies and many scholars, at home or abroad,have brought up many duplicated and near duplicated web pages detection algorithms, such ascharacteristics-based algorithms、 I-Match algorithm、 terms-based algorithms、 DigitalSyntactic Clustering and so on. This thesis analyzes the advantages and disadvantages oftraditional algorithms and we find that the common thought of these algorithms is first toextract certain information from the text, and then computing the similarity. These algorithmsare different from the methods that how to extract information from the text, so the similaritycomputation is also different. In order to improve the efficiency of the similarity computation,some algorithms compress the certain information. It can be seen from the former analysisthat to extract effective information, which can accurately represent the text, from the text, isthe key factor of computing the similarity of duplicated web pages.This thesis analyzes the advantages and disadvantages of two classic duplicate webpages detection algorithms, and based on the shortcomings of the algorithms we propose twoimproved algorithms. The main works of this paper as follows:(1) An improved duplicated web pages detection algorithm based on DSCDSC(Digital Syntactic Clustering) algorithm is one of the classical algorithmsfor duplicated web pages detection. Its basic idea is to cut into text a certain number ofshingles, from which some shingles are chosen to participate in the similarity comparison.The disadvantage of this algorithm is that shingles are randomly selected so it didn’t makefull use of the contents of the text features. To the deficiency of it, the improved algorithm maintains a terms set and use the set to choose shingles. The shingles participated in thesimilarity comparison can take advantage of the structure and content of the text.(2) An improved duplicated web pages detection algorithm based on Terms matchingThe duplicated web pages detection algorithm based on Terms matching firstly extractthe terms from the text using the TFIDF algorithm and then the text is present as a termVector Space Model. Finally the cosine formula is used for the similarity determination. Thedisadvantage of TFIDF algorithm is that it didn’t make full use of the terms locationinformation in the text when computing the weights of terms. Through the observation of webpages, we found that the contents of web pages are shorter and many of which containheadlines. These headlines are the briefest summaries of the contents. This character isapplied to compute the weights of terms.(3) The performance evaluation of the improved algorithmsIn order to evaluate the improved algorithms, we implement a prototype search enginesystem which is based on the Lucene, an open source index retrieval tools. The experimentalresults show that the improved algorithms have better recall rate and precision rate in thedetection of duplicated and near duplicated web pages than the original algorithms.
Keywords/Search Tags:Search Engine, Duplicated Web Pages Detection, Digital SyntacticClustering, Terms, Lucene
PDF Full Text Request
Related items