Font Size: a A A

Research On Modeling And Algorithms For Key Problems Of Deep Web Data Integration And Mining

Posted on:2014-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N LiFull Text:PDF
GTID:1228330398998476Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Web has been rapidly "deepened" by the tremendous online Web databases(WDB) with the potentially unlimited high-quality information hidden behind their onlyentry point—the searchable form/query interface. Since the Deep Web (most of thecontents from the WDB) is an important yet largely-unexplored frontier for informationsearch, virtual data integration and mining over Deep Web sources (WDB) is anemerging research problem which has received great attention. However, there remainseveral challenges, for example, the query interface of the domain-specific WDBautomatic discovery or recognition, the query interfaces schema matching andintegration, etc., due to the non-structured interfaces with the4V properties of big data:Volume, Variety, Velocity and Value over the Web. To address the challenging problemsand overcome limitations of existing algorithms due to their non-modeling, heuristicand trial&error correction’s serial algorithms, the auther has made a systematic studyon modeling techniques and effective and efficient algorithms for the above key issueson the basis of the author’s thorough explorations. The author’s main work andcontributions are outlined as follows:1) Aiming at the unresolved problem: automatic and effective discovery andrecognition of the domain-specific WDB’ entry point, i.e., forms, the problem is firstconverted into multi-objective optimization problem. And then, based on the “divideand conquer” strategy, a series of novel and effective strategies/algorithms are employed,including a two-step page classifier, a link scoring strategy, classifiers for advancedsearchable and domain-specific forms, crawling stopping criteria, etc. An EnhancedForm-Focused Crawler for domain-specific WDB (E-FFC) has been proposed as anovel framework to address existing solutions’ limitations. Theoretical analysis andexperimental results have shown that the E-FFC outperforms the existingdomain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate,coverage rate and crawling robustness. Moreover, to improve the performance of theE-FFC, a new architecture of an intelligent agent-based crawler for domain-specificWDB, iCrawler, was proposed.2) In order to overcome above mentioned limitations of the schema matchingalgorithm of domain-specific WDB’ interfaces, a new negative correlation metric and asemantic similarity metric are firstly presented. Moreover, based on elaborativeselections of three matchers for the schema matching, constructive ontology tree and thecombination rule of the D-S evidence theory, a new efficient and effective schemamatching algorithm for domain-specific WDB’ interfaces is introduced.3) Considering the flaws of current integration algorithms of query interfaces for thedomain-specific WDB’, a definition of the constraint matrix which can accuratelydescribe three types of constraints (hierarchical constraints, group constraints and precedence constraints) and strengths of attributes of the query interface is proposed,and then the schema tree of the query interface corresponds to only one constraintmatrix, and its reverse case is proved. Furthermore, the problem of integratingdomain-specific query interfaces is transformed into a problem of integrating theconstraint matrices and a multi-objective optimization problem model is set up. Toeffectively solve the optimization model, some strategies to extend and merge theconstraint matrices are designed. More importantly, a novel and efficient algorithmapplicable to large-scale automatic integration of domain-specific query interfaces isdeveloped. Finally, the proposed algorithm is evaluated by experiments on the realquery interface data set. Theoretical analysis and experimental results show that theproposed algorithm outperforms existing state-of-art integration algorithms ofdomain-specific WDB’ query interfaces.4) Motivated by overcoming the drawbacks of existing MLCS algorithms andattempting to tackle the NP-hard problem, the extensive comparison and deletion ofmulti-dimensional dominant points in leading MLCS algorithms is a bottleneck to itstime efficiency. To this end, a novel general MLCS parallel algorithm is devised, basedon a proposed fast non-dominated sorting method, in conjunction of a forwardcomparing operation, dynamic data block partition and multithread assignmentstrategies. Comprehensive simulation experiments for the proposed algorithm areconducted on the benchmark sets of both random and biological sequences to validateits performance. Theoretical and experimental results show that the proposed algorithmis efficient and effective, superior to the existing state-of-the-art parallel MLCSalgorithms. More importantly, the performance bottlenecks of the dominant-point-basedalgorithms are further revealed. Then a new generalized problem-solving modelICSG-PCC is proposed by introducing the following concepts: irredundant commonsubsequence graph (ICSG), parallel collection, and parallel collection chain (PCC).Finally, based on the model ICSG-PCC, a novel MLCS algorithm is presented with aforward and reverse parallel topological sorting, respectively. The proposed algorithm isevaluated by experiments on the benchmark sets of both random and biological strings.Theoretical analysis and experimental results show that the time and space complexitiesof the proposed algorithm are only relevant to the number of the matched pairs fromcomputational strings and are linear to it, which indicates that the proposed algorithmnot only outperforms existing dominant-point-based algorithms but can also efficientlytackle the NP-hard problem.
Keywords/Search Tags:Deep Web, data integration and mining, model, algorithm, query interface and MLCS
PDF Full Text Request
Related items