Font Size: a A A

Data Cleansing In The Detection Of Similar Records

Posted on:2011-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:M J XieFull Text:PDF
GTID:2178360308464805Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, The enterprices create more and more data. However, these data did not produce the necessary information .So there are always distress situtions that we get data explosion without knowledge.The value of the data depends on its own quality, so the decion based on poor data is unreliable.Curreently the large and messy data becames the bottleneck of data's applications.As a result, data cleaning became a hot topic since it is the main solution for the quality of data. Data cleaning's main tasks are filling missing values and dececting similar words.This paper focuses on detecting similar words.This paper firstly introduces the the meaning and current rearching situation home and abroad of detecting of similar records,Elaborates the basic principles of detecting similar records and the meaning of using clustering algorithm to detecting similar records.This paper outlines the main implement methods of detecting similar records and the main clustering algorithms home and abroad.Records of similarity detection is based on the matching and comparison of the semanteme of records.The formalization of records is the key to similarity detection.It means the extrating the information which can represent the semanteme of the records,then formalizing the records according to certain rules,making it to the certain expressive form which can be discerned by the computer.This paper uses the Space Vector Mode to formalize the records. each dimension of vector represent a field of records.In connection with the characteristic of similar records' tightness and the imperfection of DBSCAN clustering algorithm,which will cause similar records clustered in a large cluster.This paper focuses on the using of the Chinese text clustering algorithm in the area of detecting of similar records.This paper proposes an improvement of the DBSCAN algorithm,introduces the Institute of Computing Technology,Chinese Lexical Analysis System(ICTCLA), and using the system to create the inverted index to divide the original data.Then,clustering each area to ditecte similar records.This paper details the design and impplementation of the algorithm.
Keywords/Search Tags:data cleaning, clustering algorithm, detecting of similar records, ICTCLAS
PDF Full Text Request
Related items