Font Size: a A A

Study On Data Extraction Model And Algorithm For Complex Data Sources

Posted on:2006-01-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X B DengFull Text:PDF
GTID:1118360155960418Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the fast development of information technology, electronic documents online have turned out to be a huge information source. Users would like to apply mature database technology to query, analyze and report data of interests from this huge information source. This has spurred the research of developing data integration systems around such data sources. Data integration is a method that combines data from different data sources and provides users with a unified view of these data.The basis of data integration system is data extraction issue that can be simply described as: given a data source S, determine from S to a data repository Ra mapping M that searches for data objects in S using a data extraction model, a set of extraction rules and an extraction algorithm, and populates R with data objects found in fusing a database scheme, a set of mapping rules and a populating algorithm (in this thesis, data extraction model, extraction rules, database scheme and mapping rules are called metadata). A computer program that realizes mapping M is called a wrapper. Data integration systems often generate wrappers with wrapper generation toolkits. As there may be various data sources with various complexities, how to accurately build wrappers for these data sources is the key to realize data integration systems. This has become a hot research topic, and is also the core of this thesis.The motivation of this thesis is to extract data objects from various biological data sources and populates a biological data integration system with these data objects, aiming at building a convenient query and analysis platform for biological scientists. In point of data extraction issue, biological data sources are huge data sources not only with high extraction accuracy demand, but also with some complexities that paralyze the present wrapper generation toolkits. Firstly, data objects in biological data sources often take the form of a hierarchically nested structure with their components missing, repeat, ordered and/or unordered. This results in the structural complexity of data objects. Secondly, biological data sources often delimit their data objects using sets of nonstandard segmentation tags, with many data objects having no proper tags and some pre-announced tags occasionally appearing in the contents of data objects (i.e., there may be some noisy components in the data sources). This gives rise to the complexity of extraction rules. Thirdly, before populating database, extracted data objects often need to undergo some extended operations. This brings about the complexity of mapping rules.Starting from analyzing the problems that the present wrapper generation toolkitsencounter in dealing with complex biological data sources, this thesis presents two data extraction models and algorithms for complex data sources, and then designs and realizes ReDE wrapper generation toolkit and L-tree wrapper generation toolkit. Contributions of this thesis mainly include the following:(1) A new data extraction method for complex data sources without noisesThe present wrapper generation toolkits often require manually generating and maintaining many metadata. In order to overcome this flaw, we realize an automatic metadata generation method that can derive metadata from a regular expression (RE) by taking the full advantage of the dependencies among metadata. This method lowers the workload of generating and maintaining metadata, and can keep the consistency among metadata. As conventional RE match algorithm cannot be directly used to solve data extraction problems, we design a new data extraction algorithm based on conventional RE match. This algorithm adopts conventional RE match as the basic build block, and recursively segments, extracts and assembles data objects in the data source using a RE parse tree. We also give a cost model to analyze the efficiency of the algorithm, and then discuss and verify with experiments the scalability and the time complexity of the algorithm.(2) A new RE ambiguity checking algorithmIn constructing REs, users often intentionally introduce some beneficial ambiguity to simplify RE construction, and may also unconsciously leave some baneful ambiguity that will harm the extraction accuracy. However, the present disambiguation methods for REs cannot make a difference between these two kinds of ambiguities. In order to tackle this problem, we study the segmenting ambiguity problem in REs by the following steps: 1) give the formal definitions of segmentation features in REs; 2) propose a method that calculates segmentation features in REs based on a set of theorems; and 3) give the formal definition of the baneful ambiguity in REs and design a checking algorithm for it based on segmentation features in REs, aiming at helping users to debug REs.(3) A new data extraction method for complex data sources with noisesThe present data extraction models often lack the expressive power for complex data sources with noises. In order to overcome this drawback, by describing data sources using extended regular expressions (EREs) designed by ourselves, we propose a data extraction model named as data extraction tree (DE-tree) for complex data sources with noises, give a method for constructing locators for data objects,...
Keywords/Search Tags:Data Extraction, Data Extraction Model, Extraction Algorithm, Complex Data Source, DE-tree, L-tree, Noise
PDF Full Text Request
Related items