Font Size: a A A

Research On Web Data Extraction For Web Data Integration

Posted on:2011-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H DingFull Text:PDF
GTID:1118360305450915Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The explosive growth and popularity of the World Wide Web has resulted in a huge amount of information sources on the Internet. It is of great significance to effectively access and integrate Web data for further analysis and mining. With the web-scale heterogeneous data on the Internet, Web data integration technology has become one of the research hot spots. Web data integration can achieve the effective integration of Web data. How to effectively extract the semi-structured data existing in Web page is a key issue of Web data intergration, and it is the foundation and guarantee for Web data integration system.Accurately extract semi-structured data from Web pages has become a hot topic of current research. However, Web data are massive, heterogeneous, autonomous, and full of relationships, so the following problems are still need to be resolved. (1) The schema of Web data often change, how to build the Web entity model effectively, and then provide guidance for web data extraction and integration is a problem. (2) How to extract the target data accurately and do semantic understanding is another problem. (3) How to create the relationship between new discovered Web entity and existing Web entities effectively is also a problem.This dissertation aims at Web data integration and places foucs on the above problems. The innovative works of this dissertation mainly include the following aspects:(1) Due to the characteristics that the schemas of Web data often change, an approach based on conditional random field is proposed for enriching the schema of Web entity. The approach can be used to build the schema of Web entity dynamically.Most of the existing approaches build the schema of Web entity once and for all, and cannot enrich the schema of Web entity dynamically. An approach of building the schema of Web entity dynamically is proposed in this dissertation. The approach makes full use of those data accumulated in Web data integration system to identify the new labels, which are showed in target Web pages. Then the schema of Web entity is enriched dynamically with the new discovered labels. Experimental results show that the proposed approach can build the schema of Web entity effectively. Simultaneously, the mapping between the schema of Web entity and the attribute labels in target pages can also be get, which improves the efficiency of Web data integration.(2) Due to the characteristics of large amount of data accumulated in Web data integration system, a Web data extraction approach based on ensemble learning is proposed for improving the extraction accuracy of target data.Most of the existing approaches can only use the characteristics of Web pages to identify data element and attribute labels. However, when the structure of target Web page is complex, the quality of the training example will decrease and then the extraction accuracy of corresponding wrapper will decrease. A web data extraction approach based on ensemble learning is proposed in this dissertation. The approach makes full use of the characteristics of sampled pages and the potential characteristics of those data accumulated in Web data integration system, both of which are used to identify the data elements and attribute labels in Web pages. After discovering the data elements and attribute labels in sampled pages, the training examples can be got, and then the corresponding wrapper can be learned by using the training examples. At last, the target data are extracted from the target pages. Experimental results show that the proposed approach can improve the extraction accuracy of target data effectively.(3) Due to the two-dimensional sequence characteristics and correlative characteristics between Web data elements, an approach based on 2DCC-CRFs is proposed to improve the semantic annotation accuracy of Web objects.Previous CRFs have their limitations for semantic annotation of Web objects and cannot deal with the long distance dependencies between Web data elements efficiently. A 2DCC-CRF model is proposed in this dissertation. The proposed model can make full use of the long distance dependencies and short distance dependencies between Web data elements, and improve the annotation accuracy of Web objects. First, the structured information and the characteristics of records from external database are used to detect the candidate long distance dependencies between Web data elements. Then, two types of correlative edges are generated, which are used to describe the candidate long distance dependencies. Finally, the classic model (two-dimensional Conditional Random Fields,2DCRFs), is extended by adding correlative edges. The 2DCC-CRF model can make full use of the long distance dependencies and short distance dependencies. Experimental results show that the proposed approach can significantly improve the semantic annotation accuracy of web objects and lay a foundation for further integration of Web data.(4) Due to the characteristics of rich relationships among Web data, an approach for discovering relationships between new discovered Web entity and existing Web entities is proposed to enrich the Web entity model.Most of the existing research are focused on named entity relation extraction and lack of the research on automatically discovering relationships between Web entities. In this dissertation, an approach based on several heuristic rules is proposed to discover the relationships between new discovered Web entity and existing Web entities. First, the accumulated data in Web data integration system are used to detect the candidate relationships between new discovered Web entity and existing Web entities. Then, the candidate relationships are evaluated by two criterions. Finally, the candidate relationships with high confidence are returned to expert, who makes the final decision on the returned relationships between Web entities, including relationship type, comment, and so on. Experimental results show that the proposed approach can solve the problem of relationship discovery between Web entities effectively, which can be used to enrich the Web entity model and lay a foundation for further integration of Web data.
Keywords/Search Tags:Web data integration, Web data extraction, Web data annotation, wrapper, conditional random fields
PDF Full Text Request
Related items