Font Size: a A A

Study On Space Partitioning-based Optimized Clustering Algorithms And Related Techniques

Posted on:2006-12-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L SunFull Text:PDF
GTID:1118360185477791Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is a decision support approach that extracts hidden, unknown, potentially useful knowledge and pattern from huge volume of data. It is a new technique that appeared in 1990s for solving the problem of "rich data but poor information". To meet the requirement of discovering knowledge in very massive dataset efficiently, the algorithms of data mining are required to have excellent scalability, i.e. the runtime of algorithm should be predictable and acceptable.The space partitioning-based approach can deal with massive datasets efficiently, and is mainly used to clustering analysis and outlier detection. However, there are some problems in previous works. Firstly, the number of cells will increase exponentially varied with the size of data dimension, so they are generally used to deal with low-dimensional data. Secondly, for some space partitioning-based methods, such as the cell-based clustering algorithm, the finer the partition granularity is, then the higher clustering accuracy is, meanwhile the more cells are generated and the lower the efficiency of algorithm is. On the contrary, if the partition granularity is coarse, then the algorithm's accuracy should not to be satisfying.To solve the above issues, the thesis proposes a novel index structure named CD-Tree (Cell Dimension Tree for Streams). Without the loss of position relationship among the cells, CD-Tree only stores the non-empty cells to reduce the space complexity, which makes it very convenient to execute the procedures of clustering analysis and outlier detection. In the thesis, a definition of CD-Tree is given, related algorithms, such as tree construction, node deletion and tree merge, are developed, and the time complexity of the algorithms is analyzed and compared with other structures. In theory, for the special issue such as space partitioning, CD-Tree outperforms other structures.CD-Tree is more efficient when there are many empty cells generated under a partitioning. The number of empty cells is relevant to the degree of data skew, and for the more skewed dataset, the more empty cells will be generated. Naturally a measure...
Keywords/Search Tags:knowledge discovery, decision support, data mining, space partitioning, clustering analysis, data streams, data stream mining, sliding windows, evolution analysis, outlier detection
PDF Full Text Request
Related items