Font Size: a A A

Research On Incremental Mining Algorithm And Application For Dynamic Databases

Posted on:2008-10-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H DongFull Text:PDF
GTID:1118360215493960Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Traditional data mining discovers knowledge from stationary database.However, data warehouse is frequently changing. Because incremental data leadsto old knowledge which is mined formerly unavailable, the discovered knowledgeand patterns need to be maintained and updated dynamically. The essentialdifference of mining between dynamic database and stationary database is thatthe incremental transactions are more interested by users. Only mining addeddata based on mined results previously instead of remining whole database, theincremental algorithms are cost saving of knowledge maintenance.Because of massive data in web log, users' access patterns are changed withincremental log records. For web site managers, it is important to discover thetrend of users' access pattern. Novel efficient and effective mining algorithms arestudied for the massive dynamic data from web log.This dissertation aims to research dynamic characteristics of web access anddiscover rapid and efficient incremental mining approaches from them. It focuseson the theory and methods of clustering, classification and association rules basedon web log.In this dissertation, the notions of stationary database, dynamic database andreal-time database are introduced firstly. After reviewing state of art ofincremental mining technology and web usage mining, we pay more attention onfuzzy hierarchical clustering algorithm, neural network models, maximumfrequent patterns mining based on data partition and its incremental methods. Themain contribution of this dissertation is as follows:1. Using fuzzy sets theory, a hierarchical clustering algorithm based on fuzzygraph connectedness(FHC) is proposed, which first partitions the datasets intoseveral sub-clusters using a partitioning method, and then constructs a fuzzygraph of sub-clusters by analyzing the fuzzy-connectedness degree amongsub-clusters. By computing theλcut graph, the connected components ofthe fuzzy graph can be obtained, hence resulting the desired clustering. Thealgorithm can be performed in high-dimensional data sets, finding clusters ofarbitrary shapes such as the spherical, linear, elongated or concave ones.2. Also rendered is the incremental algorithm-IFHC applicable to periodicallyincremental environments and another extended method for asymmetricdataset-PFHC facing massive data. After analyzing affected neighborhoodsets of changed data, IFHC can deal with incremental data efficiently. Aimingto asymmetric dataset or massive data with less memory, PFHC is proposed based on FHC. Not only can FHC, IFHC and PFHC handle data withnumerical attributes, but with categorical attributes as well. The results of ourexperimental study for data sets with arbitrary shape and size are veryencouraging. The investigation demonstrates that the proposed methodgenerates better quality clusters than traditional algorithms, and scales up wellfor large databases.3. By combination of advantages of adaptive resonance theory and thecompetitive neural networks, a network model named SIN model whichovercomes the sensibility to the noise ART exists is proposed. It can eitherhold the former memory, or memorize the new information. Hebb learningrule is used in the implicit layer resulting in fast learning speed, goodcapability in classification and learning on-line.4. In traditional counter propagation network and its learning rules, overmanyneurons in hidden layer lead to "dead neurons" while fewer neurons make thenetwork unstable. In allusion to this problem, an adaptive counter propagationnetwork based on soft competition named ASCPN and its learning rules areintroduced. In ASCPN, soft competitive mechanism is presented instead ofhard competitive mechanism, and the number of competitive neurons isdecided adaptively. Firstly, the neurons in competitive layer are generateddynamically, so each neuron in competitive layer can do its best in training.Then soft competitive mechanism is employed in hidden layer, which makesmore than one neurons in competitive layer have deferent signals. Because theefficiency of neurons in competitive is improved sufficiently, ASCPN canwork well with the least amount of neurons and come true the requiredcapability of network.5. In order to mine association rules more effectively, we proposed anassociation rule mining method based on Clustering Partition-PARUC and itsincremental method-IPARUC. Because FP-tree from massive data cannot beentirely put in memory, a rapid clustering method is used firstly to dividedatabase into n parts, where the data are similar in the same part. A newstructure called AFP-tree, which can hold all items of transactions, isproposed. It is adjusted dynamically in inserting process so that nodes inAFP-tree can be shared to approaching optimization. When a new transactionsinserting, exchange of nodes is achieved by "pruning and laying back" to keepthe frequency order. Then local frequent itemsets mined from each localdataset are merged into global frequent itemsets.6. After bringing forward a framework of incremental analyse system for weblog mining, this dissertation introduces a prototype-Weblog Analyzer which integrates the forenamed core contents. After pretreatment of massive web log,it can deal with association rules, classification and clustering. Anotheradavantage of this prototype is that it can deal with incremental web log andestablish incremental mining model by updating knowledge databasedynamically. It also provides the effective decision-making for the webmasterto devise the individual web site.Finally, this dissertation summarizes the author's works and discusses thefuture work.
Keywords/Search Tags:data mining, dynamic database, Web log, usage minig, incremental mining, clustering, classification, association rules, fuzzy theory, hierarchical clustering, neural network, soft competition, FP-tree
PDF Full Text Request
Related items