Font Size: a A A

Research Of Web Page Purification And Replicas Detection In Search Engine

Posted on:2009-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:F F ZhuFull Text:PDF
GTID:2178360308478566Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet leads to digital information growing explosively. The importance of search engine which is a tool that helps people to find the necessary information in the mass information is increasing steadily. However, due to the existence of web page noise and replicas, there is a lot of redundant information in the retrieval results, which reduce the quality of service of search engine. Therefore, it is a problem requiring urgent solution that how to remove web page noise and replicas quickly and accurately.In this thesis, we study deeply on web page purification and replicas detection in search engine. First, we analyze the effect on search engine which is caused by web page noise and replicas, and then introduce the concepts and features related to web page noise and replicas. The classical algorithms of web page purification and replicas detection are studied, the advantages and shortcomings of which are analyzed.On the one hand, a new web page purification algorithm is proposed based on tree edit distance according to the feature that the structure and layout of web page in a web site are similar. The algorithm uses tree edit distance and strict top-down mapping principle to detect site template which can be removed from new web pages as web page noise by simple procedure. Our experiments show that the algorithm can ensure the integrity of web page contents. At the same time, web page noise can be removed effectively.On the other hand, a new web page replicas detection algorithm is proposed based on fingerprint to remove web page replicas. In order to eliminate the interference of web page noise, web page purification is merged into the algorithm. The algorithm makes full use of web page content and structure features and combines with fingerprint technology to achieve the removal of duplicate pages. Our experiments show that the algorithm has higher recall rate in the condition that high accuracy rate is ensured.
Keywords/Search Tags:Web page noise, Web page purification, Web page replicas, Replicas detection
PDF Full Text Request
Related items