Font Size: a A A

Research On Reducing And Classifying Massive Data

Posted on:2002-09-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:S R YeFull Text:PDF
GTID:1118360185995621Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Under the support of National Hi-tech Programme——CIMS-oriented Data Warehouse and Data Mining (863-511-946-01), Expert System on Fishery Information Analysizing (818-07-03), and Project of National Natural Science Fundation——Multi-strategy Knowledge Discovery in Database (69803010), the dissertation introduces Machine Learning in KDD and Statistical Learning Theory, and then focuses on Lattice-based data reduction, decreasing dimensions of high dimensional data, Architecture of multi-strategy massive-data mining, and Decision Trees'drawing algorithm and visualizing method, which are pivotal in reducing and classifying huge data. Besides, a domain application is provided. The contributions of the dissertation mainly list as following.(A) Lattice-based data reduction (LDR): data reduction can decrease the size of data, but preserve the information about decision-making. The methodology of LDR is discussed detailed and then two data reduction algorithms based on Lattice, INREDUCT and INREDUCTCLS, are provided. The former can be used in Clustering; the latter can be used in classification. All of them can produce hypertuples which is located between minimal E-set and maximal E-set. A similar, even better decision can be drawn from hyper-relation which consists of those hypertuples. A hypertuple h is a triplicate (|h|, {xdsp}, {childi}), where |h| is the number of simple tuples in h, {xdsp} is the expression set of features, and {childi} is the pointer set of tuples which belong to h. In the space of features, a hypertuple is a hypercube and move to dense area according to the density of boundary in each dimension. So hyperrelation is perfect representative and strong generalized. The algorithms are scalable, quasi-optional, and have nearly linear complexity.(B) Efficient calculating similarity methods in high dimensional data: Similarity is a pivotal notion in research on Lazy Learning, such as Case-Based Reasoning and k-NN (Nearest Neighbor). The dissertation focuses on the method of how to decrease complexity of calculating similarity, and introduces the similarity calculation algorithm that is based on partial features and the similarity calculation algorithm that is based on projected. For briefness and clarity, they are described in the procedure of k-NN: partial-feature-based k-NN algorithm and projection-based k-NN algorithm. In the steps of acquiring distance, using only few features can improve efficiency. This improvement is remarkable in our experiment: the former increases about 26~ 28% and the latter increased from 48% to 83%.
Keywords/Search Tags:Data Mining, KDD, Machine Learning, Data Reduction, Classification, Visualization, (Domain) Lattice, Decision tree, Prediction, Multi-strategy
PDF Full Text Request
Related items