Font Size: a A A

Research On Wrapper Adaptation In Web Data Integration

Posted on:2012-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:W LuoFull Text:PDF
GTID:2218330338961604Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
There is a tremendous amount of information available online. Web has become a huge communication platform. There is tremendous information embedded in Web pages, but much of this information is formatted to be easily read by human users, not computer applications. Extracting information from semi-structured Web pages need a programe named wrapper. wrapper is generated based on structural information of the Web pages. Wrapper maintenance is important because Web sources often change in ways that prevent the wrappers from extracting data correctly.Wrapper maintenance is the task of repairing a broken wrapper, we define two subproblems:the wrapper verification and the wrapper reinduction. Wrapper verification involves determining whether a wrapper correctly processes a given page. The wrapper can not extracting correct data when the Web source has changed its format. Then the verifition system will detect it and notify the operator or automatically launch a wrapper-repair process.The reinduction algorithm repair the extraction ruls so that the wrapper works on changed pages.A lot of information is contained in the extracted data by normal wrapper. We propose a method, which takes good use of this information to solve the problem of wrapper maintenance. When the source changes, these data are used by a novel algorithm based on a series of heuristics, to generate a new set of labeled examples that can then be used to generate a new wrapper using induction techniques. The experiments on real Web sites show that the proposed approach can effectively maintain wrappers to extract desired data with accuracies.
Keywords/Search Tags:Web data integration, Web data extraction, Wrapper, Wrapper maintenance
PDF Full Text Request
Related items