Font Size: a A A

Research On Data Mining Technology For Very Large Databases

Posted on:2004-06-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Q LiuFull Text:PDF
GTID:1118360095957000Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, especially the emerging of the network technology, our abilities to collect, store and transfer data have been improved dramatically. Comparing to the explosive growth of data, our needs for decision relevant knowledge are not satisfied yet. Knowledge discovery and data mining technology is an important approach to address this problem. To be useful for real world applications, high performance mining algorithms and software platforms are in desperately need. This paper focuses on the research into efficient and scalable mining algorithms and software platforms that support the knowledge discovery in distributed, heterogeneous, and very large databases.This paper present a novel algorithm, called OpportuneProject which is fundamentally different from those proposed in the past in that it proposes novel methods to build tree-based pseudo projections and array-based unfiltered projections for projected transaction subsets, which makes our algorithm both CPU time efficient and memory saving. It opportunistically chooses between different structures to represent projected transaction subsets, and heuristically decides projection methods to be employed. Basically, the algorithm grows the frequent item set tree by depth first search, whereas breadth first search is used to build the upper portion of the tree if necessary. The empirical results show that our algorithm is not only the most efficient on both sparse and dense databases at all levels of support threshold, but also highly scalable to very large databases.Because of the inherent complexity, mining complete set of frequent patterns could be impractical. Alternatives are to mine closed set or maximal set of frequent patterns. A novel compound frequent itemset tree is proposed to enumerate closed set of frequent patterns, which facilitates fast growth, efficient local pruning, global subsumption checking of search space. The fast hashing methods are developed. A new algorithm, called CROP to mine closed frequent patterns is designed whose performance is maximized by balancing tree growth and tree pruning overheads. Based on that, an efficient algorithm, called MOP is proposed to discover maximal frequent patterns, which combines closure checking with inclusion checking, and employs lookaheads. CROP and MOP are more efficient and scalable than the counterparts.In this paper, an information entropy based method to partition quantitative intervals and qualitative values is presented. The automatic and interactive combined approach forthe concept hierarchy formation is proposed. Upon that, multi-dimension multi-level multi-data-type association rules can be mined by constrained single-dimension single-level boolean algorithms. Upon that, the design of a new algorithm MDML-PP is presented. To mine classification rules, a new algorithm, called CRM-PP is also proposed which pushes multiple minimum support thresholds into the discovery stage of frequent patterns, and generates rules in a single stage. MDML-PP and CRM-PP are one to three orders of magnitude efficient than algorithms derived from Apriori and FPGrowth.The second part of this paper dedicates to the research into data mining software systems. Such a system, called SmartMiner is proposed based on the research fulfillments in data mining algorithms and expert systems achieved by the author. SmartMiner presents a mining definition language, called MDL, a script language that describes mining scenarios, and integrates data warehousing functionalities. Its mining engine has kind of intelligence in that it employs heuristics to select algorithms and to adjust environment settings.Finally, this paper presents cooperative mining software platforms for knowledge discovery in distributed, heterogeneous, and very large databases. A formal language describing blackboard and knowledge source is proposed. A blackboard system model, called DBC-MA, based on a production system is designed and implemented, which is the major component for distributed problem sol...
Keywords/Search Tags:knowledge discovery, data mining, association rules, classification rules, multi-level multi-dimension multi-data type rules, frequent patterns, closed frequent patterns, maximal frequent patterns, blackboard systems, distributed problem solving
PDF Full Text Request
Related items