Active data mining in a distributed setting

Posted on:2001-10-01

Degree:Ph.D

Type:Dissertation

University:The University of Rochester

Candidate:Parthasarathy, Srinivasan

Full Text:PDF

GTID:1468390014952785

Subject:Computer Science

Abstract/Summary:

Data Mining is a new interdisciplinary field merging ideas from statistics, machine learning, databases, and high performance computing. The key challenge is the extraction of knowledge and insight from massive databases in a fast and efficient manner. Most current work in data mining assumes that the data is static, and that on a database update, or user interaction, information needs to be re-mined from scratch. Since mining in practice is a largely iterative process, re-executing the algorithm from scratch each time can result in an explosion in the computational and I/O resources required. Furthermore, with improvements in Internet technology and the rapid growth of the World Wide Web, many data mining applications are being cast in a client-server mold. In such a distributed environment the problem of providing reasonable response times to an essentially interactive application is exacerbated by the communication latency between client and server.; In this dissertation we make two key contributions to address these problems. First, we outline a general strategy by which data mining algorithms can be made active—i.e., maintain valid mined information in the presence of user interaction and database updates. We accomplish this objective by maintaining a mining summary structure across database updates and user interactions. Accesses to the data are replaced with accesses to the summary structure resulting in huge I/O and computational savings. We then describe and evaluate specific active mining solutions for four key mining tasks: discretization, association mining, sequence mining, and similarity discovery. In particular, for each of these tasks, we identify: (i) the nature of summary information stored, either past or predictive, from which information can be accessed/mined efficiently; and (ii) the kind of data structure that should be used to store the information to facilitate efficient active mining.; Second, we describe a runtime framework that allows efficient caching and sharing of data among clients and servers. Traditional realizations of such interactive distributed applications employ some form of message passing or remote procedure call, and are inefficient for such applications. Our system, called InterAct, has been developed with such applications in mind, and with the goal of providing both ease of programming and efficiency. InterAct supports data sharing among distributed processes efficiently by allowing caching, by communicating only the modified data, and by allowing the coherence requirements to be relaxed on a per client and per data structure basis.

Keywords/Search Tags:

Data, Mining, Distributed, Active, Structure

Related items

1	Research On Method Of Video Structure Mining Based On Content
2	The Design And Realization Of Active Pushing System Of Personalized Consumption Information Based On Data Mining
3	Research On High Efficient Data Mining Algorithm Under The Distributed Environment
4	Distributed Active Defense System
5	Applications Of Data Mining For The Competitive Intelligence System In The Enterprise
6	Study On The Application For Active Incremental Data Mining In The On-Line Analytical Mining Model
7	The Design And Implementation Of Zmining Data Mining System Data Structure
8	Study Of Distributed Data Mining Architecture Based On Grid
9	Study On Data Mining Based Network Intrusion Detection System
10	Research And Application Of Distributed Textmining Based On Feature Learning