The Study Of Search Engine Replica Detect Methods

Posted on:2008-07-28

Degree:Master

Type:Thesis

Country:China

Candidate:L Ning

Full Text:PDF

GTID:2178360215980821

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Search engines have become the main means of access to information for computer users, compared with the traditional means of access to information. Search engine get information more quickly, more convenient and more comprehensive. Search engines have become an indispensable part of the electronic age. However, because of the special nature of the electronic information network, there are a lot of duplication of information in the web site, that content is the same, or roughly the same in different URL. For efficient of users and search engine, the web replica detect is very necessary to study.The web replica detect mainly consists of two parts: First, deal with the original website, mainly the web noise purification and extraction of the theme of the web on the website. The other is more emphasis on the replica detect for content of the web. Many research institutions are carrying out research web similarity, but also put forward a number of mature similarity detection methods, web similarity detection method is mainly divided into three aspects: URL analysis, link analysis and content analysis.This paper is divided into four sections, the first chapter devoted to the subject background, the main task of the subject. The second chapter describes the existing pages purification methods, the use of the web Jtidy purification. The principle is JTidy was transferred by parseDOM method, get a InputStream class of xml file, and create a DOM tree. Then use standard DOM API methods, use simple sentences for DOM traversal. Extract contents between specific labels which are wanted to create the indexed file. Chapter 3 presents the web similarity detection method, the existing methods are discussed, propose bloom filter testing similar methods which is content-based. After noise purification of web was extracted of the theme of the web, using CDC division each document and document become so many elements of the set which is pieces of content. Hashing all content blocks of web, then each web own a bloom filter, Bloom filters of webs will preserved. When the new web was captured, in accordance with the above steps its bloom filter will be achieved. And compare every bit with the bloom filter of the web has been stored in storage, if more than 70% of the vector is same which is 1, so that the web is a copy.Chapter 4 is analysis and investigation of web replica detect, gives the data sheet paper of the algorithm used to analysis similar of document, analysis similar grade influence to similar web quantity, and the influence of keywords popularity to the number of similar documents, as well as time of building a bloom filter and the response time of similarity detection.

Keywords/Search Tags:

replica detect, bloom filter, CDC, rabin fingerprint, purification

PDF Full Text Request

Related items

1	The Cluster Replication Attack Detect Protocol Based On Bloom Filter
2	The Researches And Applications On Bloom Filter Query Algorithm
3	Research On Replication In Peer-to-Peer File Sharing System
4	Privacy Preserved Bloom Filter And Key-value Based Bloom Filter
5	Research And Application Of Data Deduplication Technology Based On Bloom Filter
6	Multi-Bloom-Filter Query Algorithms And Their Applications
7	Research And Application Of Bloom Filter In Duplicated Webpages Deletion
8	Researches And Applications On Efficient Bloom Filter For Big Data
9	The Design Of Bloom Filter Algorithm For Key-value Storage
10	The Research On The Muti-keywords Search Technology Over P2P Network Based On Bloom Filter