Research On The Techniques Of Entity Identity On XML Data

Posted on:2014-05-31

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X M Liu

Full Text:PDF

GTID:1268330392472596

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Recently, dirty data has been widely viewed in various fields of information society,caused many problems and made great economic loss. There have been many researcheforts on data usability in our country and abroad. Entity identity is defined based onthe concepts of data entity and real entity. A data entity describes some real entity and isthe representation of the real entity in database. A database satisfies the requirements ofentity identity, if and only if there are no two data entities describing the same real entity.Data usability is one of the most popular research problems in data processing area. Therehave been many works of entity identity on relational data, however, most of them canneither applied nor extended to non-relational data. In fact, there are only few works ondata usability of non-relational data.Focusing on XML data which is a kind of widely used non-relational data, to en-hance the techniques of data usability management, this thesis studies the problems ofXML entity extraction, XML entity matching and matching results resolution, and devel-ops the related techniques on XML entity identity. The main works can be summarizedas follows.First, the entity extraction problem on XML data is firstly defined and studied, and arule-based method KEE is proposed to extract entities from XML data automatically. InXML data, there are no obvious entities, and no previous works consider entity extractionproblem, therefore, it is one of the fundamental problems in XML entity identity research.The KEE method proposed utilizes queries to represent XML entities which is concise andavoids enumerating the entities one by one. Users are allowed to provide small amountsof entities using XML key rules, and KEE will find out other interesting entities. Usingquery relaxation techniques, KEE can extract entities correctly with the existing of hetero-geneous structures in XML data. Utilizing the idea of sharing computation, an automatonbased implementation is introduced. Theoretical analysis and experimental results showthat the proposed method can extract XML entities efectively and efciently.Second, the XML entity matching problem is studied, and, to improve the efciencyof entity matching procedure, a hash based method is proposed. Given the entity similar-ity function, entity matching problem aims to discover all entity pairs which are similarenough, and is one of the fundamentals in detecting errors of entity identity. For XML entity matching, both structural and content information should be considered, however,previous works only focus on content information and can not match XML entities prop-erly. Using locality sensitive hashing techniques, the method proposed partitions givenentities into groups such that similar entities will be put into the same group with highprobabilities, only computes similarity scores for entities within the same group, and im-proves the efciency of matching. For diferent application settings, three kinds of entitysimilarity measuring functions are designed, and they are proved to be locality-sensitive.Based on the corresponding locality-sensitive hashing functions, algorithms of matchingXML entities for the three kinds of functions are designed. Theoretical analysis and ex-perimental results show that the proposed method can match XML entities efectively andefciently.Third, the problem of resolving entity matching results is studied to discover mean-ingful resolutions. Two formal definitions of resolving problem are utilized and the cor-responding theoretical analysis and algorithms are provided. The goal of entity matchingis to solve entity identification problem. The resolving problem of entity matching resultsaims to transform entity matching results to entity identification results. Based on theideas of merging multiple matching algorithms and using extra information on matchingresults, two kinds of resolving problems are formally defined. For the first one, the re-solving problem is proved to be NP-complete, and an L-reduction based approximationalgorithm is designed. For special cases, an algorithm with better approximating ratio isgiven. For the second one, the problem is proved to be NP-complete, and an LP based ap-proximation algorithm is designed. For special cases, a randomized algorithm with betterapproximating ratio is given, and to process large XML data, four heuristic algorithms arealso designed. Experimental results show that the proposed algorithms are efcient andefective.Forth, considering the drawbacks of proposed methods for entity extraction and re-solving problem, two related optimization problems are studied. The rules used by entityextraction method are hard to obtain sometimes. Therefore, to investigate the feasibilityof automatic methods for discovering rules, XML query learning problem is formally de-fined. Considering four diferent query classes, the query learning problems are studiedfrom theoretical aspects. For tractable cases, polynomial time algorithms are given, whilefor intractable cases, the computational complexity analysis and approximability analysisare provided. Given the observation that the approximation algorithms for the first resolv- ing problem are not suitable for practical applications, using parameterized complexitytheory, it is studied that whether the problem is fixed-parameter tractable. Given threeparameterized versions, two of them are proved to be fixed-parameter tractable, and thelast one is proved to be fixed-parameter intractable.

Keywords/Search Tags:

entity identity, XML data, entity extraction, entity matching, resolving algo-rithm

PDF Full Text Request

Related items

1	Research On Key Techniques Of Entity Resolution For Big Data Integration
2	Theory And Key Techniques Of Entity Retrieval
3	English Entity Answer Extraction And Home Find
4	Research On Web Entity Activity And Entity Relationship Extraction
5	Research On Key Techniques Of Entity Search For Deep Web
6	Non-ferrous Metal Retrieval Key Technology Research Entity In The Field
7	Domain-Oriented Web Entity Expansion And Robust Optimization Of The Wrapper
8	Research On Joint Extraction Of Entity Relations By Fusing Entity Local Information
9	Research On Entity Alignment Method Of Knowledge Graph In Domain Of Social Insurances And Housing Fund
10	Research And Application Of CRF Named Entity And Entity Relationships Based On Recognition