Font Size: a A A

Research On Global Schema Construction In Web Data Integration

Posted on:2012-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:X X XuFull Text:PDF
GTID:2218330338962750Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the computer and Internet, popular entities have more and more instances on the Web. Therefore, Web becomes a huge source of widely distributed data. With the growth of information requirements by industries, integrating the huge and heterogeneous data on Web is hard work. So, Web data integration is proposed by people. Web data integration system can generate the data which has unified structure and clear meanings by data extraction and data fusion. The can provide support for users in intelligence analysis and business decisions.In Web data integration system, Web data object in Web page is called Web entity instance, Web entity instances coming from different data sources has many differences in data schema:on one hand, for the same Web entity, different Web entity instances often contain different attributes, on the other, and for the same attribute, different Web entity instances often use different labels. Moreover, due to the dynamic of Web entity new Web entity instances which contain new attributes and labels are appearing on the Web. Therefore, integrating these data with many differences in schema is hard work. Web data integration system needs a global schema for all Web entity instances to eliminate the differences between Web entity instances and provide a uniform and normative schema.This paper does the research in finding the method of constructing a global schema for Web entities in Web data integration system. The contribution of this paper contains the following:(1) Based on the features of Web entity instances in Web pages and the information of global schema in Web data integration system, we propose a novel approach based on SVM to identify the main data region of Web page. This approach can identify the main data region from structured page and unstructured page effectively. The result can provide necessarily support for Web entity attribute extraction.(2) Based on the features of Web entity attributes in Web pages and the information of global schema in Web data integration system, we propose a novel approach based on AdaBoost to extract the Web entity attributes from the main data region of Web page. The result can provide necessarily information of Web entity schema and Web entity attributes for constructing global schema for Web entities.(3) Based on the characteristic of dynamic change of Web entity, we propose a novel approach based on SVM to construct global schema for Web entities. This approach can establish the map between Web entity schema and global schema effectively, and enrich global schema based on the map. When new Web entity instances which contain new entity attribute label appear on Web, this approach can enrich global schema timely to provide integrated and effective global schema for other fields in Web data integration system(4) Moreover, in this paper, we focus on the research of reciprocity between Web entity global schema and Web page main data region identification and Web entity attribute extraction:on one hand, Web page main data region identification and Web entity attribute extraction provide more exact data support; on the other, increasingly rich global schema can improve accuracy rate of Web page main data region identification and Web entity attribute extraction. Experimentation in this paper validates this reciprocity. Moreover, Web entity global schema construction prototype system which we design and realize in this paper validates our research production from the practical point of view.
Keywords/Search Tags:Web data integration, Web entity, attribute label, local schema, global schema, main data region
PDF Full Text Request
Related items