Research On Deep Web Data Extraction And Refining Methods

Posted on:2015-05-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Xin

Full Text:PDF

GTID:1228330467974280

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The evolutionary development of the Web makes it the largest informationdatabase to be more like an encyclopedia. Modern techniques are eager to integrate highquality data to obtain useful information. Compared to the Surface Web, Deep Webresources have wider domain coverage, larger volume of information, better informationquality and structure, which provide reliable data for integration and intelligentapplications. However, due to the heterogeneous, autonomy and dynamic characteristicsof Deep Web data, the integration results can be redundant, inaccurate and discrete, sothe raw data have to be refined thoroughly. Aiming to improve the quality of theintegrated data, this dissertation proposes solutions respectively such as: a data recordextraction method based on Markov Logical Networks, a concept extraction methodbased on the entity topic model, a duplicate record detection and refining method basedon active transfer learning and a conflict resolution method based on source reliabilitieswith time decay. The main achievements and contributions are as below:(1) This paper proposes a novel universal record extraction model based onMarkov Logical Networks (MLNs) to tackle the inefficient crawling problems causedby nested and discontinuous queried data records. This model firstly devises avision-tree based data records extraction strategy to extract discontinuous records frommultiple data regions. Then the model incorporates both site-level and page-levelknowledge to extract diverse attributes (especially nested and detailed attributes) fromdifferent pages to ensure the integrity of each record. Finally, it employs MLNs toeffectively integrate all functional evidence to detect data record nodes and attributenodes and align semantic labels accordingly. Consequently, this model can tolerate thepotential incompleteness and contradictory of records.(2) The value of data lies in the concept it expresses. Therefore the extraction ofthe concept of data records helps the computers better understand and processes thosedata. Due to the unique features of Deep Web data, this paper proposes a record entitytopic model to conceptualize the records. The model argues that entity distribution must impact the distribution of words in such record so it can be regarded as distributionsover a topic and entity pair. In addition, it claims that the word distribution in everyrecord should associate with topics that either generated from its own topic distributionsor the overall topic distributions. Finally, the model can learn the concept behind eachrecord through the inference, and discovery the concept behind each data record.(3) Deep Web is autonomous, dynamic, query-dependent, so the extracted datarecords are under a high repetition rate. In order to identify the same record throughdifferent references, this dissertation proposes a novel multi-source active transferlearning entity record resolution algorithm to further refine those data. This algorithmcan jointly train classifiers with relatively low sample complexity, as well as handlingreal-world samples imbalanced problem. It then transfers the common feature space toadjust weight vectors to reduce the labeling cost. At last, it successfully uses fewer datainstances from all sources to train classifiers with constant precision/recall.(4) Deep Web data update very often with data copy phenomenon happen all thetime among different sources. To refine the conflict or incomplete data from theintegrated data, this dissertation demonstrates an algorithm to discover the truth bytaking into account of the time decay’s impact on the reliabilities of each data sources. Itcan learn the agreement decay and disagreement decay in the process of similaritycalculation between the potential conflict data set gathered from the preprocessingclassification stage. Relay on the adjustment of time decay in the similarity formulas,this algorithm can obtain more accurate reliable matrix of each source and infer the truevalue of each attribute by iterative updating the clustering.

Keywords/Search Tags:

Deep Web Data Record Extraction, Concept Extraction, Data Refining, Active Transfer Learning, Duplicate Record Detection, Truth Refining

PDF Full Text Request

Related items

1	Research On Key Issues In Deep Web Data Integration
2	Research On Duplicate Record Detection Algorithms In Heterogeneous Data Sources
3	Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree
4	Mechanism Research On Slag Refining Reinforced By Si-Al-Ca Solvent Refining For Removing Key Impurity From Silicon
5	Research And Application Of Data Cleaning In The Construction Of POI Data Warehouse
6	Research On Key Technologies Of Record Match With Token
7	The Research Of Data Refining Architecture In Data Resources
8	Application Study Of Data Mining In Acetone Refining Process
9	Research On Techniques Of Automatic Data Record Analysis And Recognition For Accurate Web Information Extraction
10	Research On The Removal Of Boron From Industrial Silicon By Air Blowing And Slagging Combined Refining