Font Size: a A A

Research On The Techniques Of Entity Identity On XML Data

Posted on:2014-05-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:X M LiuFull Text:PDF
GTID:1268330392472596Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently, dirty data has been widely viewed in various fields of information society,caused many problems and made great economic loss. There have been many researcheforts on data usability in our country and abroad. Entity identity is defined based onthe concepts of data entity and real entity. A data entity describes some real entity and isthe representation of the real entity in database. A database satisfies the requirements ofentity identity, if and only if there are no two data entities describing the same real entity.Data usability is one of the most popular research problems in data processing area. Therehave been many works of entity identity on relational data, however, most of them canneither applied nor extended to non-relational data. In fact, there are only few works ondata usability of non-relational data.Focusing on XML data which is a kind of widely used non-relational data, to en-hance the techniques of data usability management, this thesis studies the problems ofXML entity extraction, XML entity matching and matching results resolution, and devel-ops the related techniques on XML entity identity. The main works can be summarizedas follows.First, the entity extraction problem on XML data is firstly defined and studied, and arule-based method KEE is proposed to extract entities from XML data automatically. InXML data, there are no obvious entities, and no previous works consider entity extractionproblem, therefore, it is one of the fundamental problems in XML entity identity research.The KEE method proposed utilizes queries to represent XML entities which is concise andavoids enumerating the entities one by one. Users are allowed to provide small amountsof entities using XML key rules, and KEE will find out other interesting entities. Usingquery relaxation techniques, KEE can extract entities correctly with the existing of hetero-geneous structures in XML data. Utilizing the idea of sharing computation, an automatonbased implementation is introduced. Theoretical analysis and experimental results showthat the proposed method can extract XML entities efectively and efciently.Second, the XML entity matching problem is studied, and, to improve the efciencyof entity matching procedure, a hash based method is proposed. Given the entity similar-ity function, entity matching problem aims to discover all entity pairs which are similarenough, and is one of the fundamentals in detecting errors of entity identity. For XML entity matching, both structural and content information should be considered, however,previous works only focus on content information and can not match XML entities prop-erly. Using locality sensitive hashing techniques, the method proposed partitions givenentities into groups such that similar entities will be put into the same group with highprobabilities, only computes similarity scores for entities within the same group, and im-proves the efciency of matching. For diferent application settings, three kinds of entitysimilarity measuring functions are designed, and they are proved to be locality-sensitive.Based on the corresponding locality-sensitive hashing functions, algorithms of matchingXML entities for the three kinds of functions are designed. Theoretical analysis and ex-perimental results show that the proposed method can match XML entities efectively andefciently.Third, the problem of resolving entity matching results is studied to discover mean-ingful resolutions. Two formal definitions of resolving problem are utilized and the cor-responding theoretical analysis and algorithms are provided. The goal of entity matchingis to solve entity identification problem. The resolving problem of entity matching resultsaims to transform entity matching results to entity identification results. Based on theideas of merging multiple matching algorithms and using extra information on matchingresults, two kinds of resolving problems are formally defined. For the first one, the re-solving problem is proved to be NP-complete, and an L-reduction based approximationalgorithm is designed. For special cases, an algorithm with better approximating ratio isgiven. For the second one, the problem is proved to be NP-complete, and an LP based ap-proximation algorithm is designed. For special cases, a randomized algorithm with betterapproximating ratio is given, and to process large XML data, four heuristic algorithms arealso designed. Experimental results show that the proposed algorithms are efcient andefective.Forth, considering the drawbacks of proposed methods for entity extraction and re-solving problem, two related optimization problems are studied. The rules used by entityextraction method are hard to obtain sometimes. Therefore, to investigate the feasibilityof automatic methods for discovering rules, XML query learning problem is formally de-fined. Considering four diferent query classes, the query learning problems are studiedfrom theoretical aspects. For tractable cases, polynomial time algorithms are given, whilefor intractable cases, the computational complexity analysis and approximability analysisare provided. Given the observation that the approximation algorithms for the first resolv- ing problem are not suitable for practical applications, using parameterized complexitytheory, it is studied that whether the problem is fixed-parameter tractable. Given threeparameterized versions, two of them are proved to be fixed-parameter tractable, and thelast one is proved to be fixed-parameter intractable.
Keywords/Search Tags:entity identity, XML data, entity extraction, entity matching, resolving algo-rithm
PDF Full Text Request
Related items