Font Size: a A A

Analysis Of Deep Web Page's Structure And Its Rich-Content Extraction

Posted on:2012-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:L TangFull Text:PDF
GTID:2218330362454321Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and extensive application of the Internet, Network resources, especially the Deep Web resources which traditional search engines can not search, are dramatically increasing. It is a hotspot of Web data management to study Deep Web.Current researches on Deep Web data extraction are only concerned with the data records and data items'extraction without thinking much about studying the whole structure of Deep Web pages and how to get the area which contains the data records from the Deep Web page. However, the investigations of these two aspects are of great importance. Therefore, this paper primarily focuses on extracting the content structure of Deep Web page and the rich-content area of the resulting page of Deep Web query. This is very meaningful for semantic Deep Web, Deep Web data records and data item extraction, Web information retrieval, text processing and so on. The main contents ofthis thesis are as follows:â‘ Tag and vision features of Deep Web pages By analyzing numerous Deep Web pages, it is discovered that Deep Web pages have tag and vision features. According to these features, a novel approach which combines tag information and vision information is proposed to analyze Deep Web pages from both the subjective and objective aspects. Tag and vision information of Deep Web pages are represented by Tag-Tree and Visual-Attribute Tree, respectively. The experimental results show that this method is better than those just depending on one kind of information.â‘¡Extracting content structure of Deep Web pagesThe content structure of a Deep Web page is represented by a tree structure, called Visual-Block-Tree, whose root block represents the whole page, each block corresponds to a rectangular region in the Web page, and leaf blocks are the blocks that can not be further segmented. It takes two phases, the noise filtering phase and the visual block clustering phase, to extract the content structure of the Deep Web page. TVS algorithm, a similarity algorithm which performs well, is proposed in visual block clustering phase.â‘¢Extracting the rich-content area of Deep Web pages.The different blocks between the Visual-Block-Trees of the query page and its resulting page are obtained through TVS algorithm. The rich-content area can be extracted from these blocks. Finally, the experimental results prove the feasibility and effectivenss of this approach.
Keywords/Search Tags:Deep Web page, content structure, rich-content area, Visual-Block-Tree, TVS algorithm
PDF Full Text Request
Related items