Font Size: a A A

Active data mining in a distributed setting

Posted on:2001-10-01Degree:Ph.DType:Dissertation
University:The University of RochesterCandidate:Parthasarathy, SrinivasanFull Text:PDF
GTID:1468390014952785Subject:Computer Science
Abstract/Summary:
Data Mining is a new interdisciplinary field merging ideas from statistics, machine learning, databases, and high performance computing. The key challenge is the extraction of knowledge and insight from massive databases in a fast and efficient manner. Most current work in data mining assumes that the data is static, and that on a database update, or user interaction, information needs to be re-mined from scratch. Since mining in practice is a largely iterative process, re-executing the algorithm from scratch each time can result in an explosion in the computational and I/O resources required. Furthermore, with improvements in Internet technology and the rapid growth of the World Wide Web, many data mining applications are being cast in a client-server mold. In such a distributed environment the problem of providing reasonable response times to an essentially interactive application is exacerbated by the communication latency between client and server.; In this dissertation we make two key contributions to address these problems. First, we outline a general strategy by which data mining algorithms can be made active—i.e., maintain valid mined information in the presence of user interaction and database updates. We accomplish this objective by maintaining a mining summary structure across database updates and user interactions. Accesses to the data are replaced with accesses to the summary structure resulting in huge I/O and computational savings. We then describe and evaluate specific active mining solutions for four key mining tasks: discretization, association mining, sequence mining, and similarity discovery. In particular, for each of these tasks, we identify: (i) the nature of summary information stored, either past or predictive, from which information can be accessed/mined efficiently; and (ii) the kind of data structure that should be used to store the information to facilitate efficient active mining.; Second, we describe a runtime framework that allows efficient caching and sharing of data among clients and servers. Traditional realizations of such interactive distributed applications employ some form of message passing or remote procedure call, and are inefficient for such applications. Our system, called InterAct, has been developed with such applications in mind, and with the goal of providing both ease of programming and efficiency. InterAct supports data sharing among distributed processes efficiently by allowing caching, by communicating only the modified data, and by allowing the coherence requirements to be relaxed on a per client and per data structure basis.
Keywords/Search Tags:Data, Mining, Distributed, Active, Structure
Related items