Font Size: a A A

Research On Data Extraction And Schema Recognition On Deep Web

Posted on:2009-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2178360308479280Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Deep Web refers to data sources that are stored in databases and can not be accessed by hyper-links but only by dynamic web page accessing. As the increase of Web databases, accessing Deep Web for information gradually becomes the main method to acquire information, for which automatic acquiring Deep Web data sources for large scale integration is even the more important.The existing data extraction approaches on deep web focus on data rather than structure, which don't care result schema. And many methods are not able to process complex data. In addition, the deep Web data sources tend to change, such as the changes of page structure and result schema, which will affect the accuracy of original extraction methods. These issues bring a large number of difficulties to data extraction which is mainly discussed in this paper.To resolve these problems, a complete and effective method supporting data extraction and schema recognition is proposed in this paper. The content includes:To extract data, a novel algorithm based on clustering is adopted, which combine both DOM structure information and visual information. It is also effective when faced complex data and noise.In addition, a schema recognition method based on labeling is proposed, which adopts 2-phase label assignment and label matching based on LCS to increase the precision.Finally, a simple extraction rule model is defined to decrease the time cost if data extraction and resolve the problem of maintenance.The experiments have shown that the methods we proposed have good performance on precision and recall. It can solve the data extraction and schema recognition problem, and gives theoretical support for data integration on Deep Web as well.
Keywords/Search Tags:Deep Web, Page Parsing, Data Extraction, Schema Recognition, Wrapper
PDF Full Text Request
Related items