Design And Implementation Of Data Preprocessing System Oriented To Data Mining

Posted on:2012-11-27

Degree:Master

Type:Thesis

Country:China

Candidate:F G Zhao

Full Text:PDF

GTID:2178330335951276

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of informatization, most of the enterprises have accumulated huge amounts of data. Many of them put a lot of effort on applying these data into the decision of self-development. Data mining is used to discover implicit but extremely useful information, and makes the utilization of data effectively. However, there are many errors exists in the real databases, due to the poorly designed database schema, improperly data management and maintenance, and the typing error etc. In addition, the exchanging of data between enterprises also caused many inconsistencies. These problems may affect the data mining task seriously. Therefore, it is important in using data preprocessing techniques to improve the data quality.Firstly, this paper introduces the theory of data preprocessing, gives an overview of all parts of data preprocessing, based on some recent references.Secondly, this paper divides the data preprocessing into 6 parts, according to all the tasks in data preprocessing with its application and research. Thus, a data mining oriented data preprocessing system is implemented based on the partition. This system contains two subsystems, Data Format Transformation Subsystem (DFTS) and Preprocessing Implementing Environment Subsystem (PIES):(1) The DFTS can connect to all kinds of data sources, including databases and flat files. It can view, operate and transform data sources with a uniform method.(2) The main function of the PIES include data quality checking, missing value filling, data normalizing, noisy smoothing, and duplicate detection.Finally, this paper also makes a detailed research on duplicate detection, including attribute similarity matching, record set dividing and the record matcher. All these techniques are implemented in the system. At last, this paper improved the suffix array algorithm which is an efficient strategy in blocking the records set but has a defect in dealing with the inconsistency at the end of the records. This paper improves it by merging similar suffix index using the sliding window method, which makes the algorithm can deal with the problem with more accuracy.It can improve the data quality by using the system to do preprocessing. It makes the data more suitable to the data mining algorithm, gives more convenience to the preprocessing process.

Keywords/Search Tags:

Data Preprocessing, Data Cleaning, Data Transformation, Duplicate Detection, Records Blocking

PDF Full Text Request

Related items

1	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
2	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
3	Research On Detection Of Approximate Duplicate Records For Massive Data
4	Research On Duplicate Records Identification Model In Deep Web
5	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
6	Research And Implementation Of Health Big Data Preprocessing Methods
7	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
8	Study Of Data Cleaning Algorithms Based On Data Warehouse
9	Design And Implementation Of Customer Information Cleaning In CRM System
10	Similar Repetitive Record Detection Method In Uncertainty Database