Font Size: a A A

Research On Data Reduction For Massive Data Based On Instances And Characters

Posted on:2012-07-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:T PengFull Text:PDF
GTID:1118330368984037Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of the computer technology, information recording and spreading become more and more convenient, and information collecting technology is more and more advanced; this fact generated massive data. It consumes more time to process the massive data; In order to reduce the instances and the characters of the instance, many methods have been proposed to reduce processing time and storage, but those methods have their own limitation and boundary. So, researching a universal method to reduce the characters and instances and to give a reasonable evaluation is of great theoretical and practical significance.By analyzing the limitation of current instance selecting method and the advantage of measuring the similarity of two data sets, this paper proposed a data reduction method based on local Hausdorff distance. The method uses the Hausdorff distance as the criteria to select the representative instances. At the same time, in order to reduce the complexity of computing Hausdorff distance, this paper uses a K-NN search method to split the original data set into smaller datasets, and then uses the Hausdorff distance to select the representative instance in smaller data sets.In order to overcome the shortcomings of traditional LLE algorithm in processing the un-uniform data set for dimensionality reduction, the paper analyzed the impact of the error between an instance and its adjacent representation. In order to improve the result of dimensionality reduction, the paper introduces a self-adapting algorithm to calculate the parameter K to reduce the error between the instance and its adjacent representation, and a LLE-based data reduction schema by using the variant parameter K. The paper also proposes a schema to adjust the K dynamical with the variance of the local uniformity, which can reduce the error between the instance and its adjacent representation and satisfy the needs of dimensionality reduction for un-uniform data set. At last, this paper gives out a search algorithm based on center point to improve the uniformity of the near neighbors of the instance and to reduce the.Due to the fact that changes on the instance in data set will affect the result of the classification and the statistical nature of the data set, the paper proposes a schema by using the classification and spatial statistics to evaluate the result of the data reduction schema. By analyzing the affect of class radius and the distance between classes and number of instance in the data set for classification accuracy, the paper gives a method to compute the classification, which can evaluate the result of the classification schema. According to the analysis on the frequency distribution, quintile fractals and distance between the instances, this paper proposes a schema to evaluate the similarity between the data sets. Since spatial autocorrelation can reflect the distribution of the instance in data set, Moran'sl is used to measure the autocorrelation and to evaluate the change of autocorrelation of the original data set and reduction data sets.The research work on data reduction based on characters selection, instance selection and reduction evaluation archieves theoretical and practical values. These archievements also have a positive significance to improve the efficiency of processing massive data and obtaining the valuable information.
Keywords/Search Tags:Instance, Characters, K-Nearest Neighbor, Uniformity, Hausdorff Distance, Search Algorithm, Reduction Evaluation
PDF Full Text Request
Related items