Font Size: a A A

The Research On A Few Key Issues In High Dimensional Data Mining

Posted on:2004-07-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:F Z YangFull Text:PDF
GTID:1118360095462830Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining refers to extracting implicit, previously unknown and usable knowledge from large amounts of data. It is one of the frontiers of research in the fields of database and DSS. The high dimensional data are frequently met when we apply data mining, for example transaction data, term-frequency data, rating data, WEB usage data and multimedia data. The universality of high dimensional data makes researches on high dimensional data mining very important. But mining in high dimensional data is extraordinarily difficult because of the curse of dimensionality. So we must adopt some special means to solve these problems.The performance of similarity indexing structures in high dimensions degrades rapidly. In lower dimensional space, we often use Lp-norm to measure the proximity between two points, but in many case the concept of this proximity is never meaningless in high dimensional space. These issues bring high dimensional data mining two challenges. One is the performance of data mining algorithms degrades, the other is many distance-based and density-based algorithms maybe not effective. These problems can be solved by the following methods: l)Transport the data from high dimensional space to lower dimensional space by dimensionality reduction, then process the data as lower dimensional data. 2)To improve the performance of mining algorithms, we can design more effective indexing structures, adopt incremental algorithms and parallel algorithms and so on. 3)Redefine some concepts in a meaningful way for high dimensional domains.Similarity search, cluster analysis and outlier detection in high dimensional data mining, as well as collaborative filtering in e-commerce are studied in this paper. We point the effect of high dimensionality on these domains and present some method to solve these problems. The researches in this paper have much important theoretical and practical significance.The majority of our work is summarized here:(1)A new function Hsim( ) to measure the proximity of objects in high dimensional spaces is presented by analyzing the characteristic of the high dimensional data. The function can not only avoid the problem which the Lp-norm lead to the non-contrasting behavior of distance in high dimensional space, but also adapt to both binary and numerical data. We also made a comparison between Hsirn()and other similarity functions.(2) According to the characteristic of quantitative transaction data, a new method based on signature table for similarity search on quantitative transaction data is presented. Experiments demonstrate this method have very good pruning efficiency for similarity search on the quantitative transaction data, so can greatly speed thesimilarity search.(3) Put forward a new algorithm for performing ratings-based collaborative filtering. Our preliminary experiments based on a number of real data sets show that the new method can both improve the scalability and quality of collaborative filtering.(4) Analyze the algorithms for high dimensional data cluster, and present a framework of similarity-based cluster analysis for high dimensional data.(5) Analyze the effects of high dimensionality on outlier dection algorithms, give a concept of projected outlier detection. The first incremental outlier detection algorithm IncLOF is presented and compared with LOF algorithm. The results from a study on synthetic data sets demonstrate that the runtime of IncLOF is much less than LOF in dynamic and high dimensional environment where high indexing structures are ineffective.
Keywords/Search Tags:data mining, high dimensional data, proximity measurement function, collaborative filtering, cluster analysis, outlier detection
PDF Full Text Request
Related items