Font Size: a A A

Extraction Algorithm, Based On Visual Features Of The Web Page

Posted on:2007-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y P WangFull Text:PDF
GTID:2208360212455799Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Today the Web has become the largest information source for people. Most information of a Web page is useful to people, except some noise information. That noise information disturbs people's reading and retrieving information from Web page.To reflect the fact of information, Information Extraction (IE) technology has been developed. In the meanwhile, information searching technology was also developed to help people searching useful data from the large amount of information which cannot be fully read by people. However, there is a precondition that both of the technologies need analysis or index correct content of the information. But most of the current Web pages contain lots of AD banner, navigation links, contact information, etc, which decrease the performance, validity of both technology.To retrieve the correct information of a Web page, a new algorithm (Vision-based Web Page Information Extraction Algorithm, VWPIEA) was proposed in this article. After analysis and researching of HTML tags, DOM tree and behaviors of people reading Web page, we come to a conclusion that HTML tags can be categorized into two kinds: block node and inline node. And after a serial of process: filtering invalid HTML tags, vision-based collapsing and filtering, parameter filtering, the real content of the Web page will be shown up. What's more, a template concept was introduced. Thus user can manually design a template, and apply the template to match some kinds of Web pages, or embed it in user's application to retrieve several content blocks. This made the algorithm more flexible. After a set of test, we got a satisfied result: nearly 100% correct and good performance. And it's an automatically algorithm except template matching.This article is composed of seven chapters. The problem of current information extraction and content searching was introduced in chapter one; current level of Web page analysis technology is described in chapter two; In chapter three, we explained the model of VWPIEA in math language. Then the next chapter tells you the process of the VWPIEA, and how it works. To...
Keywords/Search Tags:information extraction, VWPIEA, Vision-based Web Page Information Extraction Algorithm, virtual text node, DOM
PDF Full Text Request
Related items