Data quality in data mining and machine learning

Posted on:2008-05-15

Degree:Ph.D

Type:Dissertation

University:Florida Atlantic University

Candidate:Van Hulse, Jason

Full Text:PDF

GTID:1448390005979472

Subject:Statistics

Abstract/Summary:

With advances in data storage and data transmission technologies, and given the increasing use of computers by both individuals and corporations, organizations are accumulating an ever-increasing amount of information in data warehouses and databases. The huge surge in data, however, has made the process of extracting useful, actionable, and interesting knowledge from the data extremely difficult. In response to the challenges posed by operating in a data-intensive environment, the fields of data mining and machine learning (DM/ML) have successfully provided solutions to help uncover knowledge buried within data.; DM/ML techniques use automated (or semi-automated) procedures to process vast quantities of data in search of interesting patterns. DM/ML techniques do not create knowledge, instead the implicit assumption is that knowledge is present within the data, and these procedures are needed to uncover interesting, important, and previously unknown relationships. Therefore, the quality of the data is absolutely critical in ensuring successful analysis. Having high quality data, i.e., data which is (relatively) free from errors and suitable for use in data ruining tasks, is a necessary precondition for extracting useful knowledge.; In response to the important role played by data quality, this dissertation investigates data quality and its impact on DM/ML. First, we propose several innovative procedures for coping with low quality data. Another aspect of data quality, the occurrence of missing values, is also explored. Finally, a detailed experimental evaluation on learning from noisy and imbalanced datasets is provided, supplying valuable insight into how class noise in skewed datasets affects learning algorithms.

Keywords/Search Tags:

Data, DM/ML

Related items

1	Seismic Achievement Data ETL Platform Architecture Design And Software System Implementation
2	The Research And Application Of Data Preprocessing In XML Data Warehouse
3	Research On Related Issues Of Unstructured Data
4	The Data Integration、analysis And Utilization For Hosiptal Information Based On The Data Warehouse
5	Design And Implementation Of Data Mining Support Subsystem Based On Big Data Of Power
6	Design And Implementation Of Environmental Monitoring Data Management System
7	Research On The Problems And Countermeasures Of Domestic Data Journalism Practice
8	Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration
9	Big Data And Research Of Big Data In Modern Internet Applications
10	Design And Implementation Of The Bayonet Data Integration Platform