Font Size: a A A

Research On Deep Web Data Extraction And Refining Methods

Posted on:2015-05-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XinFull Text:PDF
GTID:1228330467974280Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The evolutionary development of the Web makes it the largest informationdatabase to be more like an encyclopedia. Modern techniques are eager to integrate highquality data to obtain useful information. Compared to the Surface Web, Deep Webresources have wider domain coverage, larger volume of information, better informationquality and structure, which provide reliable data for integration and intelligentapplications. However, due to the heterogeneous, autonomy and dynamic characteristicsof Deep Web data, the integration results can be redundant, inaccurate and discrete, sothe raw data have to be refined thoroughly. Aiming to improve the quality of theintegrated data, this dissertation proposes solutions respectively such as: a data recordextraction method based on Markov Logical Networks, a concept extraction methodbased on the entity topic model, a duplicate record detection and refining method basedon active transfer learning and a conflict resolution method based on source reliabilitieswith time decay. The main achievements and contributions are as below:(1) This paper proposes a novel universal record extraction model based onMarkov Logical Networks (MLNs) to tackle the inefficient crawling problems causedby nested and discontinuous queried data records. This model firstly devises avision-tree based data records extraction strategy to extract discontinuous records frommultiple data regions. Then the model incorporates both site-level and page-levelknowledge to extract diverse attributes (especially nested and detailed attributes) fromdifferent pages to ensure the integrity of each record. Finally, it employs MLNs toeffectively integrate all functional evidence to detect data record nodes and attributenodes and align semantic labels accordingly. Consequently, this model can tolerate thepotential incompleteness and contradictory of records.(2) The value of data lies in the concept it expresses. Therefore the extraction ofthe concept of data records helps the computers better understand and processes thosedata. Due to the unique features of Deep Web data, this paper proposes a record entitytopic model to conceptualize the records. The model argues that entity distribution must impact the distribution of words in such record so it can be regarded as distributionsover a topic and entity pair. In addition, it claims that the word distribution in everyrecord should associate with topics that either generated from its own topic distributionsor the overall topic distributions. Finally, the model can learn the concept behind eachrecord through the inference, and discovery the concept behind each data record.(3) Deep Web is autonomous, dynamic, query-dependent, so the extracted datarecords are under a high repetition rate. In order to identify the same record throughdifferent references, this dissertation proposes a novel multi-source active transferlearning entity record resolution algorithm to further refine those data. This algorithmcan jointly train classifiers with relatively low sample complexity, as well as handlingreal-world samples imbalanced problem. It then transfers the common feature space toadjust weight vectors to reduce the labeling cost. At last, it successfully uses fewer datainstances from all sources to train classifiers with constant precision/recall.(4) Deep Web data update very often with data copy phenomenon happen all thetime among different sources. To refine the conflict or incomplete data from theintegrated data, this dissertation demonstrates an algorithm to discover the truth bytaking into account of the time decay’s impact on the reliabilities of each data sources. Itcan learn the agreement decay and disagreement decay in the process of similaritycalculation between the potential conflict data set gathered from the preprocessingclassification stage. Relay on the adjustment of time decay in the similarity formulas,this algorithm can obtain more accurate reliable matrix of each source and infer the truevalue of each attribute by iterative updating the clustering.
Keywords/Search Tags:Deep Web Data Record Extraction, Concept Extraction, Data Refining, Active Transfer Learning, Duplicate Record Detection, Truth Refining
PDF Full Text Request
Related items