Font Size: a A A

Research Of Data Mining Method On Uncertain Data

Posted on:2013-09-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J DongFull Text:PDF
GTID:1228330392454859Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, uncertain data has been more widely appreciated. In many areas, suchas the economic, financial, telecommunications, logistics and other areas of theapplication, the prevalence of uncertain data and it often plays a key role. Uncertain datamining also became a very important research topic in data mining. Existing uncertaindata mining algorithms were analyzed, from the frequent pattern mining and clusteringanalysis of two angles proposed new algorithms to improve the corresponding algorithmsefficiency of the implementation of the data mining algorithm.Most of algorithms based on tree structure for mining frequent pattern on uncertaindata streams always store a large number of tree nodes, and record the correspondinginformation of data streams which can cause massive information storages. To solve thisproblem, an algorithm based on compressed tree and bit vector table for mining frequentpatterns on uncertain data streams is proposed. The uncertain data streams are initializedto probability-vector table, in the table, the items are represented by transactions, andunlike other bit vector tables the occurrence probabilities of items are stored in it. We alsopropose compressed tree in which the items with different probabilities are stored in thesame tree nodes, then the items and its probability in the tree node correspond to the bitvector table are converted into binary bit vector. Future more, each leaf node of the tree isconnected to an array which is used to store the combination of all items and theirexpected support in the path, what’s more, the leaf nodes are stored in the LeafList. Finally,we scan the arrays that are linked to the leaf nodes in the LeafList and compare theexpected support that is stored in the array with a minimum support threshold minSup toget all the frequent itemsets.Algorithms based on row enumeration always scan and construct conditionaltransposed tables, which increases the execution time and space cost. To solve thisproblem, we adopt the FPDAG (Frequent patterns Directed Acyclic Graph) to compressthe dataset to save the memory space. In FPDAG, each node is related to a rowid, and eachtwo nodes have a corresponding directed edge which stores the common items of the two rowids. Each row is given an integer according to its coming order and the FPDAGfollows that order. A directed acyclic graph records the relation between rows and items bydoing AND(&) operation with the nodes’ binary code of the edges. In algorithm, we adoptthe BitTable to compress the dataset firstly, and then construct FPDAG according to theBitTable. We increase the same items of the adjacent edges to implement pattern growth,traverse the DAG in reversal way and adopt a close-checking method to generate allfrequent closed itemsets.Most existing vulnerability taxonomy classifies vulnerabilities by their idiosyncrasies,weaknesses, flaws and faults et al. The disadvantage of the taxonomy is that theclassification standard is not unified and there is overlap classification phenomenon invulnerability taxonomy. To solve the problem, an algorithm virtual grid-based clusteringof uncertain data on vulnerability database, is proposed. The algorithm defines a virtualgrid structure, the cells are divided into real cells and virtual cells, but only the real cellswhich contain data objects stored in memory. The probability attribute value similarity isdefined, which compares the number of non-numeric attributes with the same valuebetween tuples to measure the similarity. We provide a secondary partition algorithm toimprove the similarity between the tuples in the same cell, the algorithm merges a tupleinto it’s high-density neighbor cell which has the maximum value of probability attributevalue similarity with it. Then, a novel identify cluster algorithm is provided to cluster thehigh-density real cells. It can identify clusters of arbitrary shapes by traversing real cellstwice.Due to the randomness of the partition of grids, the edge points of clusters might bepartitioned into the sparse grids. These points would become noise information out ofclusters when we cluster data stream by grid-density based algorithm. To solve theproblem, this paper proposes SDGCStream, a data stream clustering algorithm based onspatial directed graph with core, which uses the spatial directed graph and the orthocenterof the sparse grids to handle the edge points of clusters. At first, the algorithm defines astructure SDGC (spatial directed graph with core) to store the summary statistics of datastream, the border grids of clusters can be gotten according to the edge information ofSDGC; then the vertices of SDGC are maintained as the stream arriving; when the clustering quest comes, the edge information is generated, and the first clustering resultsare got through clustering on SDGC, then we judge whether the points of sparse gridswhich are adjacent to the border of a cluster belong to the cluster, according to theorthocenter information and the border vertices of SDGC, at last, a strategy based on thedistance between clusters is presented to adjust the clustering results after handling theborder of clusters.Through the analysis of experimental results show that frequent patterns andclustering in this paper, the uncertain data mining algorithms to improve the accuracy ofthe traditional similar algorithms in frequent pattern mining efficiency and similaritymeasure. An algorithm based on compressed tree and bit vector table, an algorithm basedon frequent patterns Directed Acyclic Graph, fictitious grid-based clustering of uncertaindata, a data stream clustering algorithm based on spatial directed graph with core, theperformance of the algorithm has improved, and better scalability.
Keywords/Search Tags:Data mining, Data streams, Uncertain data, Frequent patterns, Clustering, Grid
PDF Full Text Request
Related items