Integration of multiple prediction models for centralized and distributed knowledge discovery in databases

Posted on:2003-07-27

Degree:Ph.D

Type:Thesis

University:Temple University

Candidate:Lazarevic, Aleksandar Milorad

Full Text:PDF

GTID:2468390011989620

Subject:Computer Science

Abstract/Summary:

Data mining systems aim to discover patterns and extract useful information from facts recorded in very large databases. One means of acquiring knowledge from databases is to apply various machine learning algorithms that compute descriptive representations of the data as well as patterns that may be exhibited in the data.; In contemporary machine learning community, it is well known that a combination of many different predictors can be an effective technique for improving prediction accuracy. All combining methods can be categorized as results of two parallel lines of study: (1) expert methods and (2) ensemble methods. In this thesis, both combination methods are investigated in order to further improve prediction accuracy. The first proposed algorithm is developed for regression tasks, where a sequence of clustering steps is followed by local regression. The second proposed algorithm represents the method of combining specialized classification models through boosting algorithm. It is designed for heterogeneous databases with attribute instability, where instead of a single global classifier for each boosting round, there are specialized classifiers responsible for each homogeneous region.; Most of the current generation of machine learning algorithms, however, are computationally complex and require all data to be resident in main memory, which is clearly untenable for many realistic problems and databases. Therefore, in this dissertation we also investigate data mining techniques that scale up to large and physically distributed data sets. An innovative distributed clustering as well as the parallel and distributed boosting algorithms are proposed to solve this problem.; As well, we propose several novel techniques for reducing the large databases through controlled sampling. In addition, we investigate the possibilities for selecting optimal subsets of prediction models by eliminating the most correlated and the least accurate ones.; Finally, we present the software system for data analysis and modeling that provides flexible machine learning tools for supporting an interactive and automated knowledge discovery process in large centralized or distributed databases. The special emphasis is put into the data miming for spatial, business and medical databases.; All proposed algorithms are evaluated on several publicly available and several synthetic data sets. The experimental results have shown improvements in algorithms' performance comparing to standard machine learning and data mining methods.

Keywords/Search Tags:

Data, Machine learning, Distributed, Mining, Prediction, Methods, Models, Large

Related items

1	Application of Machine Learning and Statistical Learning Methods for Prediction In A Large-Scale Vegetation Ma
2	Research On Distributed Optimization Methods For Large-Scale Machine Learning
3	Research On Application Of Machine Learning And Data Mining In Bioinformatics
4	Scalable Sparse Machine Learning Methods for Big Dat
5	Research And Implementation Of Unified Large Data Mining Service Platform Based On Spark MLlib
6	Application Of Machine Learning Algorithms In Data Mining
7	Research On Intrusion Detection Techniques Based On Machine Learning And Data Mining Methods
8	Research On Machine Learning Based Multi-source Heterogeneous Data Mining For Risk Prediction
9	Machine learning methods and models for ranking
10	Methods for large-scale machine learning and computer vision