Data Cleansing In The Detection Of Similar Records

Posted on:2011-06-20

Degree:Master

Type:Thesis

Country:China

Candidate:M J Xie

Full Text:PDF

GTID:2178360308464805

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology, The enterprices create more and more data. However, these data did not produce the necessary information .So there are always distress situtions that we get data explosion without knowledge.The value of the data depends on its own quality, so the decion based on poor data is unreliable.Curreently the large and messy data becames the bottleneck of data's applications.As a result, data cleaning became a hot topic since it is the main solution for the quality of data. Data cleaning's main tasks are filling missing values and dececting similar words.This paper focuses on detecting similar words.This paper firstly introduces the the meaning and current rearching situation home and abroad of detecting of similar records,Elaborates the basic principles of detecting similar records and the meaning of using clustering algorithm to detecting similar records.This paper outlines the main implement methods of detecting similar records and the main clustering algorithms home and abroad.Records of similarity detection is based on the matching and comparison of the semanteme of records.The formalization of records is the key to similarity detection.It means the extrating the information which can represent the semanteme of the records,then formalizing the records according to certain rules,making it to the certain expressive form which can be discerned by the computer.This paper uses the Space Vector Mode to formalize the records. each dimension of vector represent a field of records.In connection with the characteristic of similar records' tightness and the imperfection of DBSCAN clustering algorithm,which will cause similar records clustered in a large cluster.This paper focuses on the using of the Chinese text clustering algorithm in the area of detecting of similar records.This paper proposes an improvement of the DBSCAN algorithm,introduces the Institute of Computing Technology,Chinese Lexical Analysis System(ICTCLA), and using the system to create the inverted index to divide the original data.Then,clustering each area to ditecte similar records.This paper details the design and impplementation of the algorithm.

Keywords/Search Tags:

data cleaning, clustering algorithm, detecting of similar records, ICTCLAS

PDF Full Text Request

Related items

1	Research On Data Cleaning Algorithm Based On Clustering
2	Design And Implementation Of Customer Information Cleaning In CRM System
3	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
4	Research And Implementation Of Data Cleansing Based On Clustering Algorithm
5	Some Main Technology's Research Of Data Cleaning
6	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
7	Similar Repetitive Record Detection Method In Uncertainty Database
8	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
9	Research On Detection Of Approximate Duplicate Records For Massive Data
10	Research On Data Cleaning Of Approximately Duplicated Records