Font Size: a A A

Preprocessing Of Web Pages

Posted on:2009-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:L WanFull Text:PDF
GTID:2178360242981288Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Preprocessing of web pages is one of the important steps during the search engine's workflow, which processes web pages fetched from crawler and then feed this processed web pages into indexer. The quality of preprocessing heavily influences the quality of indexing and user experiences.Web pages preprocessing includes: web pages normalization, near-duplicate web pages'detection and web pages'noise removal.Web pages normalization includes: web pages' automatic encoding detection, encoding uniform and some other conversions.Nowadays, there are 3 ways to detect duplicate web pages:1) content-based method: web pages are compared based on there content.2) link-based method: web pages are compared based on there in-link.3) anchor-based method: web pages are compared based on there anchor text. Anchor text is a summary of targeted web page, and provides obvious category information.There are two kind of duplicate:1) Full-duplicate: two pages are completely same, such as mirror web pages.2) Near-duplicate: two pages'bodies are same, but their templates, fonts are different.This paper only talks about content-based near-duplicate detection algorithm which is the major method that current used.The road map of near-duplicate detection follows: Brute-force method need O (t~2 d~2), where t specifies the number of tokens and d specifies the number of documents. 1997, Andrei Z. Broder et al propose a shingle method which just need O ( s~2 d~2), where s specifies the number of shingle. At the same year, he proposes another method which need O ( d log d ). 2002, Abdur chowdhury et al propose I-Match algorithm which need O ( d log d ). 2003 Dennis Fetterly et al propose another shingle method, which need O ( n ), this method is the fastest algorithm. 2006, Monika Henzinger propose another method which has a better precision. 2007, R. J. Bayardo et al improve this method.Web pages noises are removed in 3 ways:1) Structure-based method: DOM tree or its variants are often used to represent web pages, and then useful information is extracted based on some heuristic rules.2) Template-based method: templates are extracted from a set of web pages, and then useful information is extracted based on these templates.3) Vision-based method: useful information is extracted based on web page's vision data, such as font, location etc.In the absence of topic, traditional web page noise removal algorithm judges which content block is noise and which is not with some heuristic rules, such as structure, vision etc. But within the environment of focused crawling, clear topic presents, we can achieve higher precision and better effect in a different way. I propose a noise removal algorithm based on focused topic. After a variation of DOM(Document Object Module) tree of web pages is constructed, i.e. content block tree, noise segment will be judged by a trained classifier, and this paper propose a formula to determine whether or not a node is noise. Experimental results demonstrate that the precision of our method is 87%, which is much better than previous method whose precision is 42%.
Keywords/Search Tags:Preprocessing
PDF Full Text Request
Related items