Distributed data mining using stochastic gradient boosting

Posted on:2005-10-09

Degree:M.Sc

Type:Thesis

University:Queen's University at Kingston (Canada)

Candidate:Pu, Jinyan

Full Text:PDF

GTID:2458390008981688

Subject:Computer Science

Abstract/Summary:

Data mining is an information extraction process that aims to discover hidden knowledge contained in databases. Traditional data mining techniques, which have been used in a wide variety of applications, assume centralized data storage. However, as the amount of data available worldwide grows faster than processing speed, these techniques are constantly being challenged by the demand for efficient processing of a massive volume of data. Such data are also often dispersed geographically. Moreover, in some applications, privacy protection policy prohibits data disclosure and relocation, which renders these techniques unusable. Distributed data mining (DDM) addresses these issues by distributing the learning process to local sites and only transferring the minimum amount of necessary data to a central site for integration. DDM has been enjoying a growing amount of attention since its inception, and a few frameworks have been proposed. Nevertheless, DDM is still an underdeveloped area with a large number of open issues.; In this thesis, we propose a new technique for DDM, based on the Stochastic Gradient Boosting algorithm. Chaining distributed gradient boosting models exploits the fact that Stochastic Gradient Boosting builds additive stagewise models, and solves the data mining problem for vertically partitioned datasets (i.e. the datasets partitioned by attributes). Base models are built locally from different attributes, and their updates of a target function are combined to generate a global model. The chaining technique has the advantage of completely eliminating data communication.; We empirically show that high accuracy can be achieved by the chaining technique. A strong inter-site correlation helps to increase the interactions that can be captured by the base models, and consequently produces a high accuracy. We also explore the relationships between different measurements of base models, and their potential to be used as weights for base models for a possible accuracy improvement.

Keywords/Search Tags:

Data, Stochastic gradient, Gradient boosting, Base models, Distributed, DDM

Related items

1	Application Of Gradient Boosting Model In Bank Product Forecast
2	A Research And Application On Stochastic Gradient Descent Algorithm In Distributed Cluster
3	Research On Distributed Stochastic Gradient Descent Algorithm
4	Optimization Algorithms Of Neural Networks Weights Based On Stochastic Gradient Descent
5	Research On Improving The Convergence Performance Of Stochastic Gradient Descent In Distributed Machine Learning
6	Imbalanced Stochastic Gradient Descent Online Algorithm For SVM
7	Gradient Based Identification For Dual-Rate Sampled-Data Systems
8	The Reseach And Application Of Stochastic Gradient Descent And Dual Coordinate Descent Algorithm
9	Weighted Multi-innovation Stochastic Gradient Identification Algorithms For Output Error Models
10	Research On Lip-reading With A Sequential Gradient Boosting Algorithm Based On CTC Loss