Font Size: a A A

Research On Data Cleaning Method Based On Optimal Feature Selection

Posted on:2012-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:J E YangFull Text:PDF
GTID:2218330368982410Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The society has entered to the information stage now. Making the right decisions based on the right information is crucial to the enterprises. A lot of enterprises set up their own data-base, as preparation for the digging further information which can help to make the strategic decisions. The data in the data-base is collected from multiple independent operational systems, so the original data in the data-base is usually not correct due to the wrong data entering including disaccording of words, wrong spelling and so on, which directly effect the correctness of the decisions based on those data collected. So it is very necessary to clean those data. One of the key step is to detect the approximately duplicated record. The approximately duplicated record means the duplicate records about the same entity in nature but unidentifiable resulted from the difference in writing forms and spellings.This paper not only has gaven the background infromation and significence of the research that I have done, and introduced the current situation of the data cleaning in and outside the country,but also illustrated the defenination and necessity of data cleaning, and its principles, basic precess and methods. It analysed techniques of the attribution cleaning,duplicated record cleaning and pre-disposing.Then it focused on approximately duplicated record detecting method and gave out approximately duplicated record detecting based on attribution optimal feature selection. With the chosed key field and digital persition of the field and the cluster thought this method combined the big data set into multiple small data sets based on the zone bit code of characters.After the attribution optimal selection of every small data set with the attribution feature optimal selection method, selected the characteristic attribute. Following that, it applied the field mapping technique on approximately duplicated record detecting according to the attribution weight and valid weight value strategy. To avoid missing some records because of choosing improper key field, the multiple-detction method can be used. Experimental results show the proposed method is more percise in detection and time efficient. On the basis of analyzing and studying many data cleaning algorithms and data cleaning architecture, a data cleaning architecture is designed,and this paper elaborate the main functions and cleaning flow of each module of system architecture.
Keywords/Search Tags:data cleaning, area code, attribution optimal selection, approximately duplicate records, cleaning system architecture
PDF Full Text Request
Related items