Analysis Of Deep Web Page's Structure And Its Rich-Content Extraction

Posted on:2012-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:L Tang

Full Text:PDF

GTID:2218330362454321

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development and extensive application of the Internet, Network resources, especially the Deep Web resources which traditional search engines can not search, are dramatically increasing. It is a hotspot of Web data management to study Deep Web.Current researches on Deep Web data extraction are only concerned with the data records and data items'extraction without thinking much about studying the whole structure of Deep Web pages and how to get the area which contains the data records from the Deep Web page. However, the investigations of these two aspects are of great importance. Therefore, this paper primarily focuses on extracting the content structure of Deep Web page and the rich-content area of the resulting page of Deep Web query. This is very meaningful for semantic Deep Web, Deep Web data records and data item extraction, Web information retrieval, text processing and so on. The main contents ofthis thesis are as follows:â‘ Tag and vision features of Deep Web pages By analyzing numerous Deep Web pages, it is discovered that Deep Web pages have tag and vision features. According to these features, a novel approach which combines tag information and vision information is proposed to analyze Deep Web pages from both the subjective and objective aspects. Tag and vision information of Deep Web pages are represented by Tag-Tree and Visual-Attribute Tree, respectively. The experimental results show that this method is better than those just depending on one kind of information.â‘¡Extracting content structure of Deep Web pagesThe content structure of a Deep Web page is represented by a tree structure, called Visual-Block-Tree, whose root block represents the whole page, each block corresponds to a rectangular region in the Web page, and leaf blocks are the blocks that can not be further segmented. It takes two phases, the noise filtering phase and the visual block clustering phase, to extract the content structure of the Deep Web page. TVS algorithm, a similarity algorithm which performs well, is proposed in visual block clustering phase.â‘¢Extracting the rich-content area of Deep Web pages.The different blocks between the Visual-Block-Trees of the query page and its resulting page are obtained through TVS algorithm. The rich-content area can be extracted from these blocks. Finally, the experimental results prove the feasibility and effectivenss of this approach.

Keywords/Search Tags:

Deep Web page, content structure, rich-content area, Visual-Block-Tree, TVS algorithm

PDF Full Text Request

Related items

1	Research On WEB Page Structure And Data Extraction Technology
2	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
3	A Framework Of Web Page Analysis And Content Extraction Based On Coordinate Tree
4	Research On Content Extraction In HTML Web Pages Based Multi-Features
5	The Research And Implementation On Content Extraction In Web Pages Based Page Segmentation
6	The Design Of Visual Website Content Management System
7	Study On Web Content Extraction And Semantic Recognition
8	Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM
9	Web Page Metadata Extraction Method Based On Visual Block Recognition
10	The Designation And Implementation Of Business Insight System Base On Web Content