Font Size: a A A

Distributed data mining using stochastic gradient boosting

Posted on:2005-10-09Degree:M.ScType:Thesis
University:Queen's University at Kingston (Canada)Candidate:Pu, JinyanFull Text:PDF
GTID:2458390008981688Subject:Computer Science
Abstract/Summary:
Data mining is an information extraction process that aims to discover hidden knowledge contained in databases. Traditional data mining techniques, which have been used in a wide variety of applications, assume centralized data storage. However, as the amount of data available worldwide grows faster than processing speed, these techniques are constantly being challenged by the demand for efficient processing of a massive volume of data. Such data are also often dispersed geographically. Moreover, in some applications, privacy protection policy prohibits data disclosure and relocation, which renders these techniques unusable. Distributed data mining (DDM) addresses these issues by distributing the learning process to local sites and only transferring the minimum amount of necessary data to a central site for integration. DDM has been enjoying a growing amount of attention since its inception, and a few frameworks have been proposed. Nevertheless, DDM is still an underdeveloped area with a large number of open issues.; In this thesis, we propose a new technique for DDM, based on the Stochastic Gradient Boosting algorithm. Chaining distributed gradient boosting models exploits the fact that Stochastic Gradient Boosting builds additive stagewise models, and solves the data mining problem for vertically partitioned datasets (i.e. the datasets partitioned by attributes). Base models are built locally from different attributes, and their updates of a target function are combined to generate a global model. The chaining technique has the advantage of completely eliminating data communication.; We empirically show that high accuracy can be achieved by the chaining technique. A strong inter-site correlation helps to increase the interactions that can be captured by the base models, and consequently produces a high accuracy. We also explore the relationships between different measurements of base models, and their potential to be used as weights for base models for a possible accuracy improvement.
Keywords/Search Tags:Data, Stochastic gradient, Gradient boosting, Base models, Distributed, DDM
Related items