Font Size: a A A

Duplicates Cleansing Based On Semantic Association

Posted on:2012-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:L HuangFull Text:PDF
GTID:1118330335955064Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and Continuous increment of data resources, information is becoming more and more complicated. How to manage so huge data became more and more difficulty. More and more studies are made on this field. The high quality data will enhance the data processing and the analysis efficiency, will achieve the twice the result with half the effort effect. Data quality is one of the most important in data management. The data cleansing method based on semantic association utilizes semantic associations among data to help detecting the garbage data. This research mainly includes:(1) the recall of the duplicates detection, (2) entities disambiguity, (3) the effection of the dulicates detection.Due to diversity of data formats, missing of certain properties, imprecise records in heterogeneous literature databases, there exist duplicate records when integrating heterogeneous databases. Duplicate records lower the efficiency of information retrieval to eliminate of duplicate records. This paper proposes an approach, named as length filtering and dynamic weighting (LFDW) for duplicate records cleansing. There are three steps in LFDW. The first step is length filtering. In this step, according to the length of record, those record pairs are sifted if there exists a big difference in their lengths. Secondly, this approach detects duplicate records using dynamic weighting properties. Specially, since author-name is the important property of literature and one author may has different styles of name, a fuzzy name matching method is adopted to identify the same author who has different name style. Finally, for improving the performance of duplicate detection, we adopt a dynamic sliding-window algorithm when comparing records.Due to homonyms, abbreviations, etc., name ambiguity is widely available in web and e-document. For example, when integrating heterogeneous literature databases, because there are different name specifications, different authors may be thought of as the same author, and vice versa. Therefore, name ambiguity makes data robust even dirty and lowers the precision of information retrieval. This paper presents an approach, named as Semantic Association based Name Disambiguation method (SAND), to solve person name ambiguity. The basic idea of SAND is to explore the semantic association of name entities and cluster name entities according to their associations. Finally, the name entities in the same group are considered as the same entities.Duplicate detection is a hotspot in the study of heterogeneous data integration and information retrieval. The efficiency and precision of detection are the goals of this study. In this paper, a duplicate detecting method based on semantic links among data is introduces, and a novel approach is proposed, named Most Possible Duplicates Partition (MPDP) to help detect duplicates efficiently. The main principle of MPDP is to partition those data into most-possible-duplicate parts, in which there is a higher probability of duplicates. MPDP does not sort data into certain order as did classical Sorted Neighborhood Method (SNM). We give an effective partition method using semantic links among entities.With the quick development of the semantic web technology, linked data explosion has become a challenging problem. Since linked data are always from different resources which may have overlap with each other, they could have duplicates. These duplicates may cause ambiguity and even error in reasoning. However, attentions are seldom paid to this problem. This paper studies the problem and gives a solution, named K-radius subgraph comparison (KSC). The proposed method is based on Hierarchical Graph Model. KSC combines similar and comparison of'context' to detect duplicate in linked data.The recall of the proposed method is 97%. The average of the precisions is 96.6%, and the f-measure is 95.2%. The number of duplicates, which are correctly detected in a unit time, is 1.008. All the measures are higher than the tranditional methods. The proposed method performs better on both the recall rate and the precision over the traditional one. The proposed method improves accuracy and efficiency in detecting duplicates obviously. In addition, it is convinced to be a more simple and quick method. The proposed method also is a more effective and better approach for entity disambiguation.
Keywords/Search Tags:Semantic association, Duplicates cleasning, Name disambiguate, Entity identity, Sematic context, RDF data cleansing, K-radius subgraph comparison
PDF Full Text Request
Related items