Font Size: a A A

The Study Of Search Engine Replica Detect Methods

Posted on:2008-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:L NingFull Text:PDF
GTID:2178360215980821Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search engines have become the main means of access to information for computer users, compared with the traditional means of access to information. Search engine get information more quickly, more convenient and more comprehensive. Search engines have become an indispensable part of the electronic age. However, because of the special nature of the electronic information network, there are a lot of duplication of information in the web site, that content is the same, or roughly the same in different URL. For efficient of users and search engine, the web replica detect is very necessary to study.The web replica detect mainly consists of two parts: First, deal with the original website, mainly the web noise purification and extraction of the theme of the web on the website. The other is more emphasis on the replica detect for content of the web. Many research institutions are carrying out research web similarity, but also put forward a number of mature similarity detection methods, web similarity detection method is mainly divided into three aspects: URL analysis, link analysis and content analysis.This paper is divided into four sections, the first chapter devoted to the subject background, the main task of the subject. The second chapter describes the existing pages purification methods, the use of the web Jtidy purification. The principle is JTidy was transferred by parseDOM method, get a InputStream class of xml file, and create a DOM tree. Then use standard DOM API methods, use simple sentences for DOM traversal. Extract contents between specific labels which are wanted to create the indexed file. Chapter 3 presents the web similarity detection method, the existing methods are discussed, propose bloom filter testing similar methods which is content-based. After noise purification of web was extracted of the theme of the web, using CDC division each document and document become so many elements of the set which is pieces of content. Hashing all content blocks of web, then each web own a bloom filter, Bloom filters of webs will preserved. When the new web was captured, in accordance with the above steps its bloom filter will be achieved. And compare every bit with the bloom filter of the web has been stored in storage, if more than 70% of the vector is same which is 1, so that the web is a copy.Chapter 4 is analysis and investigation of web replica detect, gives the data sheet paper of the algorithm used to analysis similar of document, analysis similar grade influence to similar web quantity, and the influence of keywords popularity to the number of similar documents, as well as time of building a bloom filter and the response time of similarity detection.
Keywords/Search Tags:replica detect, bloom filter, CDC, rabin fingerprint, purification
PDF Full Text Request
Related items