Font Size: a A A

Research And Implementation On Search Engine System Remove Duplicate Webpages

Posted on:2012-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:J J NiuFull Text:PDF
GTID:2178330332995469Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the computer hardware and software and the Internet technology rapid development,network kinds of information rapid growth,already become one of human history Information resources most quantity,information resources types most complete,most information resources scale of comprehensive information resource library.but,when users search needed information on the Internet,only know search keyword, do not know the specific URL,hence need for using the search engine to help users find need imformation.Search engineer can convenient users search imfomation from the Internet,save users time, generally welcomed by everyone. Internet occur a lot of powerful search engine, for Chinese like Baidu and for muti-language Google.but,some sites pursuit business interests, in order to improve its web site hits, quantity reprinted other articles. good article reprinted in the blog and the forum. after occurs popular events and public interest hot topic, Many sites will be reported and reproduced, Allows the users obtain many links to different but the same content returned from search engines, reduces the user experience.user have to find they need imfomation in large same results, the existence of duplicate pages also increased the storage capacity of the index database.removal of duplicated webpages is a way of improve search engine practicality and efficiency.First,baseed on the HTML tags max text block algorithm,achieve the main feature selection,and base on this, Proposed base on keyword and feature string removal of duplicated webpages algorithm,and development of the experimental system, validate the algorithm, through analysis and discussion of experimental results prove the effectiveness of the algorithm.The main work of the paper as the following:1.theory research:analysis search eingineer operating principle and key technologies, origin of the text near duplicate detection to pages near duplicate detection field,introduction several classical algorithms.2.web pages detect duplicate not as same as text detect duplicate, need to extract remove pages noise like Navigation, advertising, copyright, etc for pages main content. base on HTML tags largest main block, considering the various types of web pages, design algorithm to extract pages main content. 3.Algorithm improve:based on the extract pages main content,general consider three classic removal of duplicated webpages algorithm:base on signature,features sentence and KCC algorithm,draw on its advantages,proposed base on keywords and feature string on duplicate detection algorithm. the algorithm is simple and efficient, can effectively identify slight changes in the process reproduced of the page,improve removal of duplicated webpages accuracy.4. Design and implementation:base on open source framework lucene,achieve a simple alone computer search engine system,take base on keywords and signature algorithm enbedded to the duplicate modules. the system can crawl web pages as needed,detect duplicate pages, bulid index and searching, according to the user query keywords to return relevant results.5. Experimental analysis:enbedded removal of duplicated webpages algorithm to search engine system, on the capture of 900 pages containing duplicate data sets to detect duplicate pages, and analyze the experimental results,prove the effectiveness of the improved algorithm.
Keywords/Search Tags:search engine, removal of duplicated webpages, web page noise reducing, max text block
PDF Full Text Request
Related items