Research On Data Cleaning Method Based On Optimal Feature Selection

Posted on:2012-07-23

Degree:Master

Type:Thesis

Country:China

Candidate:J E Yang

Full Text:PDF

GTID:2218330368982410

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The society has entered to the information stage now. Making the right decisions based on the right information is crucial to the enterprises. A lot of enterprises set up their own data-base, as preparation for the digging further information which can help to make the strategic decisions. The data in the data-base is collected from multiple independent operational systems, so the original data in the data-base is usually not correct due to the wrong data entering including disaccording of words, wrong spelling and so on, which directly effect the correctness of the decisions based on those data collected. So it is very necessary to clean those data. One of the key step is to detect the approximately duplicated record. The approximately duplicated record means the duplicate records about the same entity in nature but unidentifiable resulted from the difference in writing forms and spellings.This paper not only has gaven the background infromation and significence of the research that I have done, and introduced the current situation of the data cleaning in and outside the country,but also illustrated the defenination and necessity of data cleaning, and its principles, basic precess and methods. It analysed techniques of the attribution cleaning,duplicated record cleaning and pre-disposing.Then it focused on approximately duplicated record detecting method and gave out approximately duplicated record detecting based on attribution optimal feature selection. With the chosed key field and digital persition of the field and the cluster thought this method combined the big data set into multiple small data sets based on the zone bit code of characters.After the attribution optimal selection of every small data set with the attribution feature optimal selection method, selected the characteristic attribute. Following that, it applied the field mapping technique on approximately duplicated record detecting according to the attribution weight and valid weight value strategy. To avoid missing some records because of choosing improper key field, the multiple-detction method can be used. Experimental results show the proposed method is more percise in detection and time efficient. On the basis of analyzing and studying many data cleaning algorithms and data cleaning architecture, a data cleaning architecture is designed,and this paper elaborate the main functions and cleaning flow of each module of system architecture.

Keywords/Search Tags:

data cleaning, area code, attribution optimal selection, approximately duplicate records, cleaning system architecture

PDF Full Text Request

Related items

1	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
2	Research On Data Cleaning Of Approximately Duplicated Records
3	Research On Detection Of Approximate Duplicate Records For Massive Data
4	Similar Repetitive Record Detection Method In Uncertainty Database
5	Some Main Technology's Research Of Data Cleaning
6	Design And Implementation Of Customer Information Cleaning In CRM System
7	Research Of Data Cleaning Method Based On Data Warehouse
8	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
9	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
10	Study Of Data Cleaning Algorithms Based On Data Warehouse