Font Size: a A A

Data Quality Control: Research, Design, And Implementation In Data Preprocessing

Posted on:2005-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:J LuanFull Text:PDF
GTID:2168360152955528Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Data mining is a process that discovers knowledge from mass of data and information. At present, most papers concentrate on data mining algorithms while neglect data preprocessing, which generates data with integrity, little redundancy and relativity between attributes for further analysis. Plenty of insignificant data may affect mining efficiency and outliers may decrease the precision of algorithms. Therefore, data preprocessing has become the crux in data mining system implementation. The contribution of this paper is twofold: Data Warehouse quality control (Extract_Transform_Load) and Web site quality control framework. Our key items are:(1) Analyzing the specialties and difficulties of ETL and giving the ETL architecture.(2) Investigating data problems appear in single data source and multi-data sources (schema level and instance level) of DBMS.(3) Implementing an enterprise data ETL by shell scripts on AIX system of RS6000. (4) Designing a scalable framework for text preprocessing: two model (VSM and LM) and three phase algorithm for quality control.(5) Analyzing the main modules in the framework, such as word split, language analysis, modeling and feature selection. In terms of the characteristics of text stream, we put forward two ad hoc strategies for the framework: (a) A High-speed Matching Strategy Based on Similarity; (b) An Incremental Support Vector Machine Training Strategy. The extensive experiment results are given to show substantial improvements over incremental SVM training method.
Keywords/Search Tags:Data mining, Data Preprocessing, Data warehouse, ETL, Template matching, SVM
PDF Full Text Request
Related items