Font Size: a A A

Integration of multiple prediction models for centralized and distributed knowledge discovery in databases

Posted on:2003-07-27Degree:Ph.DType:Thesis
University:Temple UniversityCandidate:Lazarevic, Aleksandar MiloradFull Text:PDF
GTID:2468390011989620Subject:Computer Science
Abstract/Summary:
Data mining systems aim to discover patterns and extract useful information from facts recorded in very large databases. One means of acquiring knowledge from databases is to apply various machine learning algorithms that compute descriptive representations of the data as well as patterns that may be exhibited in the data.; In contemporary machine learning community, it is well known that a combination of many different predictors can be an effective technique for improving prediction accuracy. All combining methods can be categorized as results of two parallel lines of study: (1) expert methods and (2)  ensemble methods. In this thesis, both combination methods are investigated in order to further improve prediction accuracy. The first proposed algorithm is developed for regression tasks, where a sequence of clustering steps is followed by local regression. The second proposed algorithm represents the method of combining specialized classification models through boosting algorithm. It is designed for heterogeneous databases with attribute instability, where instead of a single global classifier for each boosting round, there are specialized classifiers responsible for each homogeneous region.; Most of the current generation of machine learning algorithms, however, are computationally complex and require all data to be resident in main memory, which is clearly untenable for many realistic problems and databases. Therefore, in this dissertation we also investigate data mining techniques that scale up to large and physically distributed data sets. An innovative distributed clustering as well as the parallel and distributed boosting algorithms are proposed to solve this problem.; As well, we propose several novel techniques for reducing the large databases through controlled sampling. In addition, we investigate the possibilities for selecting optimal subsets of prediction models by eliminating the most correlated and the least accurate ones.; Finally, we present the software system for data analysis and modeling that provides flexible machine learning tools for supporting an interactive and automated knowledge discovery process in large centralized or distributed databases. The special emphasis is put into the data miming for spatial, business and medical databases.; All proposed algorithms are evaluated on several publicly available and several synthetic data sets. The experimental results have shown improvements in algorithms' performance comparing to standard machine learning and data mining methods.
Keywords/Search Tags:Data, Machine learning, Distributed, Mining, Prediction, Methods, Models, Large
Related items