Font Size: a A A

Design And Implementation Of Data Preprocessing System Oriented To Data Mining

Posted on:2012-11-27Degree:MasterType:Thesis
Country:ChinaCandidate:F G ZhaoFull Text:PDF
GTID:2178330335951276Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of informatization, most of the enterprises have accumulated huge amounts of data. Many of them put a lot of effort on applying these data into the decision of self-development. Data mining is used to discover implicit but extremely useful information, and makes the utilization of data effectively. However, there are many errors exists in the real databases, due to the poorly designed database schema, improperly data management and maintenance, and the typing error etc. In addition, the exchanging of data between enterprises also caused many inconsistencies. These problems may affect the data mining task seriously. Therefore, it is important in using data preprocessing techniques to improve the data quality.Firstly, this paper introduces the theory of data preprocessing, gives an overview of all parts of data preprocessing, based on some recent references.Secondly, this paper divides the data preprocessing into 6 parts, according to all the tasks in data preprocessing with its application and research. Thus, a data mining oriented data preprocessing system is implemented based on the partition. This system contains two subsystems, Data Format Transformation Subsystem (DFTS) and Preprocessing Implementing Environment Subsystem (PIES):(1) The DFTS can connect to all kinds of data sources, including databases and flat files. It can view, operate and transform data sources with a uniform method.(2) The main function of the PIES include data quality checking, missing value filling, data normalizing, noisy smoothing, and duplicate detection.Finally, this paper also makes a detailed research on duplicate detection, including attribute similarity matching, record set dividing and the record matcher. All these techniques are implemented in the system. At last, this paper improved the suffix array algorithm which is an efficient strategy in blocking the records set but has a defect in dealing with the inconsistency at the end of the records. This paper improves it by merging similar suffix index using the sliding window method, which makes the algorithm can deal with the problem with more accuracy.It can improve the data quality by using the system to do preprocessing. It makes the data more suitable to the data mining algorithm, gives more convenience to the preprocessing process.
Keywords/Search Tags:Data Preprocessing, Data Cleaning, Data Transformation, Duplicate Detection, Records Blocking
PDF Full Text Request
Related items