Font Size: a A A

A cluster tracking algorithm for distributed data analytics

Posted on:2013-01-03Degree:M.SType:Thesis
University:Rutgers The State University of New Jersey - New BrunswickCandidate:Lasluisa, Raul SFull Text:PDF
GTID:2458390008974575Subject:Engineering
Abstract/Summary:
Large-scale data analytics has enabled society to model, and inspect their data to the point where useful information can be extracted, conclusions can be drawn and decision making can be enhanced. The breadth of data being analyzed today has enabled us to make proactive decision in processes we otherwise could not. At the same time the data being analyzed is both becoming larger and more distributed, making it more complex to aggregate the data to a central location and process in a timely manner in order to make decisions. This can be attributed to the scale of current distributed computational infrastructures used to solve complex problems, while generating an increasing amount of data. This data is being created not only from applications solving problems but also from the systems running the applications as well. Creating a situation where centralized data analytics benefits decline as appose to decentralized approaches.;Data analytics algorithms must therefore meet several new requirements in order to continue to process data in a timely manner. One approach to process distributed data is to use algorithms that themselves can run in a distributed manner. Using such algorithms benefit a variety of situations where there is a desire to reduce the cost of transporting and subsequently storing data. Examples can be seen in autonomic computing, where the goal is to manage large system with minimal intervention by administrators and scientific visualization where visualization techniques are performed using a secondary system.;In this work we show that combining online (and distributed) data clustering, and cluster tracking can be effectively used to detect meaningful changes in data patterns occurring in the multiple streams. In doing so, we provide an alternative to a centralized approach where data must be centralized before any analytics may be executed. Specifically, we propose an cluster tracking algorithm which takes advantage of a decentralized clustering algorithm in order to detect changes in data to then take proactive decisions. We demonstrate its accuracy and effectiveness in three different case: 1) VM provisioning 2) scheduling of Hadoop resources, and 3) object tracking in scientific applications.
Keywords/Search Tags:Data, Tracking, Distributed, Algorithm
Related items