Extraction Algorithm, Based On Visual Features Of The Web Page

Posted on:2007-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y P Wang

Full Text:PDF

GTID:2208360212455799

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Today the Web has become the largest information source for people. Most information of a Web page is useful to people, except some noise information. That noise information disturbs people's reading and retrieving information from Web page.To reflect the fact of information, Information Extraction (IE) technology has been developed. In the meanwhile, information searching technology was also developed to help people searching useful data from the large amount of information which cannot be fully read by people. However, there is a precondition that both of the technologies need analysis or index correct content of the information. But most of the current Web pages contain lots of AD banner, navigation links, contact information, etc, which decrease the performance, validity of both technology.To retrieve the correct information of a Web page, a new algorithm (Vision-based Web Page Information Extraction Algorithm, VWPIEA) was proposed in this article. After analysis and researching of HTML tags, DOM tree and behaviors of people reading Web page, we come to a conclusion that HTML tags can be categorized into two kinds: block node and inline node. And after a serial of process: filtering invalid HTML tags, vision-based collapsing and filtering, parameter filtering, the real content of the Web page will be shown up. What's more, a template concept was introduced. Thus user can manually design a template, and apply the template to match some kinds of Web pages, or embed it in user's application to retrieve several content blocks. This made the algorithm more flexible. After a set of test, we got a satisfied result: nearly 100% correct and good performance. And it's an automatically algorithm except template matching.This article is composed of seven chapters. The problem of current information extraction and content searching was introduced in chapter one; current level of Web page analysis technology is described in chapter two; In chapter three, we explained the model of VWPIEA in math language. Then the next chapter tells you the process of the VWPIEA, and how it works. To...

Keywords/Search Tags:

information extraction, VWPIEA, Vision-based Web Page Information Extraction Algorithm, virtual text node, DOM

PDF Full Text Request

Related items

1	Research On Vision-based Web Page Information Extraction Technology
2	Research On Specialty Knowledge Retrieval Method Based On Web Information Extraction
3	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
4	Research On Multi-page Special Web Page Text Extraction And Merging Technology
5	Visual Web Page Information Extraction And Text Feature Word Extraction Technology Research
6	Design And Implementation Of Web Information Extraction Rules
7	Reasersh On Internet Public Opinion Information Extraction And Classification
8	Research On Web Article Automatic Extraction Method Based On Page Segmentation
9	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website
10	Research On Web Page Classification And Information Collection