Research On Reducing And Classifying Massive Data

Posted on:2002-09-06

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S R Ye

Full Text:PDF

GTID:1118360185995621

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Under the support of National Hi-tech Programmeâ€”â€”CIMS-oriented Data Warehouse and Data Mining (863-511-946-01), Expert System on Fishery Information Analysizing (818-07-03), and Project of National Natural Science Fundationâ€”â€”Multi-strategy Knowledge Discovery in Database (69803010), the dissertation introduces Machine Learning in KDD and Statistical Learning Theory, and then focuses on Lattice-based data reduction, decreasing dimensions of high dimensional data, Architecture of multi-strategy massive-data mining, and Decision Trees'drawing algorithm and visualizing method, which are pivotal in reducing and classifying huge data. Besides, a domain application is provided. The contributions of the dissertation mainly list as following.(A) Lattice-based data reduction (LDR): data reduction can decrease the size of data, but preserve the information about decision-making. The methodology of LDR is discussed detailed and then two data reduction algorithms based on Lattice, INREDUCT and INREDUCTCLS, are provided. The former can be used in Clustering; the latter can be used in classification. All of them can produce hypertuples which is located between minimal E-set and maximal E-set. A similar, even better decision can be drawn from hyper-relation which consists of those hypertuples. A hypertuple h is a triplicate (|h|, {xdsp}, {childi}), where |h| is the number of simple tuples in h, {xdsp} is the expression set of features, and {childi} is the pointer set of tuples which belong to h. In the space of features, a hypertuple is a hypercube and move to dense area according to the density of boundary in each dimension. So hyperrelation is perfect representative and strong generalized. The algorithms are scalable, quasi-optional, and have nearly linear complexity.(B) Efficient calculating similarity methods in high dimensional data: Similarity is a pivotal notion in research on Lazy Learning, such as Case-Based Reasoning and k-NN (Nearest Neighbor). The dissertation focuses on the method of how to decrease complexity of calculating similarity, and introduces the similarity calculation algorithm that is based on partial features and the similarity calculation algorithm that is based on projected. For briefness and clarity, they are described in the procedure of k-NN: partial-feature-based k-NN algorithm and projection-based k-NN algorithm. In the steps of acquiring distance, using only few features can improve efficiency. This improvement is remarkable in our experiment: the former increases about 26~ 28% and the latter increased from 48% to 83%.

Keywords/Search Tags:

Data Mining, KDD, Machine Learning, Data Reduction, Classification, Visualization, (Domain) Lattice, Decision tree, Prediction, Multi-strategy

PDF Full Text Request

Related items

1	Research And Application On The Decision Tree Classification Algorithm Of Data Mining
2	Application Of Machine Learning Algorithms In Total Housing And Classification Statistics
3	The Research And Implementation Of Grade Data Mining And Analysis System Based On Multi-strategy
4	Multi-relational Classification Based On Decision Tree Algorithm
5	Decision Tree Classification Algorithm Based On The Correlation Function
6	Classification Algorithm Of Data Mining
7	Research On Application Of Machine Learning And Data Mining In Bioinformatics
8	Research On Decision Tree Algorithm For Uncertain Data
9	Research On Multi-relational Decision Tree Classification Algorithm
10	The Research Of Decision Tree Algorithm In Data Mining