Font Size: a A A

Instance Reduction With Granular Computing Based Data Importance Labeling

Posted on:2020-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2428330596977374Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Nowadays,the amount of data generated by different fields is exponentially increasing.However,the processing performance of instance-based machine learning is struggling under this growth,and the great storage cost of big-size data also needs to be solved.Therefore,instance reduction is one of the hot spots in large-scale data processing.Many existing instance reduction algorithms struggle with the trade-off among great computational complexity,reduction rate and the learner performance on the reduced datasets,especially for the large-scale datasets.Motivated by this,the instance reduction algorithms with granular computing based data importance labeling are studied here,and research contents are mainly as follows:(1)Fast Data Reduction with Granulation based Instances Importance Labeling:According to the research results of Granular Computing in the field of feature selection,we propose a fast data reduction algorithm with granulation based instances importance labeling(FDR-GIIL).The original dataset is first mapped into a lower-dimension space and granulated into K granules by applying K-means;then,the importance of each instance in every granule is labeled based on its Hausdorff distance;those instances whose importance values are lower than an experimentally tuned threshold are selected to delete;furthermore,the crowding degrees of those instances with the same data importance are calculated,and finally the less crowded instances are retained to the reduced subset,so that the well-distributed samples can be reserved.The presented algorithm is applied to 18 different sizes of datasets from UCI Repository,and its outstanding performance in classification accuracy,size reduction rate,and running time are illustrated by comparing the proposed approach with other seven data reduction methods.The experimental results demonstrate that the proposed algorithm can greatly reduce the computational cost with higher classification accuracy when the reduction size is the same as all the compared algorithms.(2)Improved Data Reduction Combining Noise Deletion and Feature Selection:Although FDR-GIIL can quickly reduce instances,the classification accuracies of the large-scale datasets are still to be further improved.Therefore,noise deletion and feature selection are combined with FDR-GIIL algorithm to enhance the performance of data reduction(EPDR).First,the edited nearest neighbor(ENN)is used to remove the noisy instances among the initial dataset,and the granulation mapping based onprincipal component analysis(PCA)is proposed;the Euclidean distance and Value Difference Metric(VDM)are mixed to calculate the instances importance.EPDR is applied to the popular datasets and compared with FDR-GIIL as well as a popular data reduction method.The experimental results show that the proposed algorithm can effectively enhance the classification accuracy on reduced datasets within the acceptable running time.The proposed fast data reduction with granulation based data importance labeling uses the ‘divide and conquer' strategy to label the data importance,it can quickly remove most unimportant data from the original dataset.FDR-GIIL has an obvious advantage in reducing computation cost;the improved data reduction combining noise deletion and feature selection makes further improvement on FDR-GIIL,the performance of data reduction is enhanced by using ENN denoising,PCA dimensionality reduction and the importance labeling based mixed distance calculation.
Keywords/Search Tags:instance reduction, granular computing, data importance labeling, K-means, mixed distance calculation
PDF Full Text Request
Related items