Font Size: a A A

Automatic Deep Web Data Extraction And Integration Using Conditional Probabilistic Graphical Models

Posted on:2008-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:J B HuangFull Text:PDF
GTID:1118360242978290Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Nowadays, tremendous structured data is hiddened in the deep web which can just be obtained in the dynamic web pages generated according to the queries submitted to the web query interfaces. Due to the poor structure of web pages and the instability and large scale of Deep Web, it is a challenge that integrate the data automatically and consume it effectively. Probabilistic graphical learning is a hot research topic in the domain of machine learning which had been used widely and successfully in data mining, information extraction, information retrieval, etc. Several enhanced models of conditional random field (CRP), a type of prababitistic undirected graphical model, and practical approach are proposed to solve some challenging issues in Deep Web data extraction and integration. The main contributions of this dissertation are as follows:The fact that deep web resources are very sparsely distributed makes the problem of locating them especially challenging. In order to crawl the Deep Web query interfaces quickly within the relevent websites, a model of CRF is trained over the samples of user's naviagtion path through exploit a variety of features arround the hyperlinks. Then a link scoring algorithm based on reinforcement learning and the trained CRF model is used to assign each hyperlink in the currently visited pages by the crawler a priority. The aim is to guide the crawler through a optimal paths leading to target pages. Experimental results indicate that the proposed crawler clearly outperforms other form crawlers.Two maximum entropy based classifiers are proposed to automatically and accurately identify the query forms of online databases and their topic category respectively. The classifer for distinguish query forms is only explore the the structure features of web forms. While the other corporate all the features of content and structure in the context of web forms. Experiments indicate that the topic classifiers lead to high accuracy.As the elements of a Web form are not necessarily linearly laid-out, a hierarchical sequential CRF (HSCRF) model is proposed to better incorporate dependencies across the hierarchically laid-out information. Methods for performing the tasks of model-parameter estimation and label inferenece of a HSCRF model have been proposed. Experimental results indicate that the proposed model achieve good performace on schema matching between heterogenous web query interfaces. A improved approach for finding data regions embedded in a HTML Web page is presented. After that, a policy is proposed to identify the dynamical Web data regions which combines the techniques of Web page clustering and cross-page data region analysis. Experimental results show the given approach's effectiveness. Moreover, an improved sequence labeling model named Mixed Skip-Chain CRF model is used to integrate Web records extracted from multiple sites into relational database. The proposed model can be trained on the mixed samples set including labeled samples and unlabeled database records, thereby reduce the dependence on manually labeled training data. Moreover, it provides a novel way to incorporate the long-distance dependencies between different state variants. Experimental results show that the proposed model can improve the accuracy of attribute labeling significantly.A finite state conditional random field model for edit sequence between strings is presented. Unlike generative models, however, the model is trained on both positive and negative instances of string pairs. Finally, a Support Vector Machine trained on selected samples is used to classify each of the record pairs in duplicate or non-duplicate one. Experimental results on a range of datasets show that the proposed approach can improve duplicate accuracy over traditional techniques and has a good ability of noisy data constraint.
Keywords/Search Tags:Deep Web, Information Extraction, Data Integration, Conditional Random Fields, Probabilistic Graphical Learning Models
PDF Full Text Request
Related items