Font Size: a A A

Mining decentralized data repositories

Posted on:2002-05-06Degree:Ph.DType:Dissertation
University:University of MichiganCandidate:Jensen, Viviane CrestanaFull Text:PDF
GTID:1468390011990464Subject:Computer Science
Abstract/Summary:
Technology for data mining, i.e., finding useful trends and patterns in large data repositories, has acquired significant importance with increasing availability of online data. While such technology is typically applied to centrally stored data, real-life database design and management, and performance aspects suggest the mining of decentralized data, which consists of several tables, perhaps obtained via normalization or partitioning and allocation, stored in several repositories with possibly separate administration and schema. The few prior extensions to mining for such data have algorithms developed largely for parallel processing as opposed to addressing the specific issues for decentralized data. Most approaches to mining decentralized data require the separate tables to be joined to form a single table.; In contrast, this dissertation presents techniques for mining decentralized data that do not require the join of all tables. The approach exploits foreign key relationships to develop decentralized algorithms that execute concurrently on the separate tables, and thereafter merge the results. We develop our techniques using the specific example of association rules discovery. Important issues concerning the merging of partial results, the computation and memory requirements, and the associated costs and trade-offs are examined.; Several different decentralized strategies arise, and an algebra is presented which allows enumeration of the many different decentralized mining strategies, each with different processing costs. Based on this algebra, heuristics are developed that reduce the overall computation, I/O, and communication costs. When cost estimates are available for the basic operations, there is an opportunity to optimize for the best strategy in a manner similar to query processing. As such, our approach may be suitably integrated with available query processing algorithms for large-scale decentralized data mining.; Our decentralized approach is empirically validated, and in cases of interest it performs significantly better than the typical centralized approach. Several decentralized alternatives are implemented, and the heuristic rules are validated, i.e., are shown to choose optimal or nearly optimal plans. The decentralized approach presented in this dissertation may be adapted to different counting strategies, different storage structures, incremental mining, and to exploit indices and summary data where available; some of these improvements are infeasible in a centralized approach.; This dissertation provides an approach to decentralized mining that establishes its feasibility and importance, and opens numerous new avenues for research in data mining.
Keywords/Search Tags:Data, Mining, Decentralized, Approach
Related items