Font Size: a A A

Privacy Preserving Feature Selection In Distributed Environment

Posted on:2014-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:W Q WanFull Text:PDF
GTID:2248330395483984Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of network technology and the improvement of computing power andstorage ability, the size of dataset is rapid growth. In order to obtain valuable information from thedata, data mining is necessary. And feature selection is one of the important and frequently usedtechniques in data preprocessing for data mining. It reduces the number of features, removesirrelevant, redundant, or noisy data, and brings the immediate effects for applications: speeding up adata mining algorithm, improving mining performance such as predictive accuracy and resultcomprehensibility.Privacy preserving is very important in data mining, which is given a great concern as datamining is widely used. Thus, how to select feature effectively based on privacy preserving is a hottopic. However, most of feature selection methods do not address issues about privacy, such asmedical and financial records, which may leads to serious information security problems in datamining and pattern recognition. In addition, the data from sorts of application may be stored inmultiple sites. In order to mining so large and distributed data, distributed computing technologyhas emerged. The purpose of this work is to develop a privacy preserving-based distributed featureselection algorithm and preserve the privacy of features and data.In order to preserve privacy for features, combing PCA (Principal Component Analysis) andSVM-RFE, optimizing the evaluation criterion on three methods, a privacy preserving featureselection algorithm based on PCA and SVM-RFE is proposed. The simulation results indicate thatthe algorithm performs well. While selecting the important features, it can decrease the sum offeatures subset’s amount of information to the utmost.In order to preserve privacy for data, under the Map-Reduce framework, combining the threestatistics including gini index, misclassification and entropy with the differential privacy, we presenta new privacy preserving-based distributed feature selection algorithm. At the same time, thetheoretic analysis for privacy guarantee is also presented. The simulation results on UCI repositoryand synthetic dataset indicate that during the selection of important features, it can preserve privacyinformation to a certain extent with less time cost than on centralized counterpart.
Keywords/Search Tags:Privacy preserving, Feature selection, Distribution, Differential privacy, Principal component analysis
PDF Full Text Request
Related items