Font Size: a A A

Research On Efficient Feature Selection And Learning Algorithms For Big Data

Posted on:2016-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:J B ZhangFull Text:PDF
GTID:1108330485988597Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, the rapid growth of data has presented challenges and opportunities which are faced by many industries with the fast development of IT and communications technolo-gies, e.g., Internet, Internet of Things, Cloud Computing and Tri-network Convergence. Our society has entered the era of big data. In the big data environment, it highlights the importance of feature selection and learning to mine its inherent knowledge for real applications, which not only solve the curses of dimensionality, alleviate the problem of "information rich, knowl-edge poor" and reduce the complexity, but also improve the understandability of the data. This thesis carries on the study of parallel large-scale feature selection, complex data fusion and ef-ficient learning algorithms, deep learning-based feature representation models, algorithms and applications. The main research work and innovation are summarized in the following aspects.Part I:Parallel Large-scale Feature Selection (Chapter 3)A unified parallel large-scale framework for feature selection is presented. Then, its cor-responding three parallel methods are proposed, e.g., model parallelism (MP), data parallelism (DP), and model-data parallelism (MDP). Heuristic feature selection is chosen as the research object. Its core is to calculate the significance measures of features. Then a unified representa-tion of feature evaluation functions is presented. Furthermore, the divide-and-conquer methods for four representative evaluation functions are shown, and MapReduce-based and Spark-based Parallel Large-scale Attribute Reduction (PLAR) algorithms are designed. Subsequently, gran-ular computing (GrC) theory is introduced for accelerating the process of feature selection. Sub-sequently by combining with MDP, Algorithm PLAR-MDP is presented. Finally, experimental evaluation and analysis on UCI datasets and astronomical datasets are given. All experiments are carried out on the big data computing platforms, e.g., Hadoop and Spark, which verifies the effectiveness of the proposed algorithms. It is also shown that PLAR-MDP can maximize the performance of data processing by combining with MP, DP and GrC methods.Part â…¡:Complex Data Fusion & Efficient Learning Algorithm (Chapter 4)Composite information systems are defined first. Then, an extension of rough set model, composite rough set (CRS), is presented. It can process multiple different types of attributes simultaneously and provide a novel approach for complex data fusion. Calculation of rough approximations of a concept is a key step in rough set-based feature selection algorithms. To compute approximations efficiently, a basic vector is introduced. Then a matrix representa-tion method for CRS approximations is presented, and a batch algorithm based on matrix for computing CRS approximations is designed together with a parallel algorithm. Furthermore, parallel algorithms based on Single-GPU and Multi-GPU for computing CRS approximations are proposed. Finally, experiments are carried out on UCI and user-defined datasets which verifies the effectiveness of the proposed algorithms. It is also shown that the Multi-GPU al-gorithm achieves a high speedup on the GPU cluster and its performance has been improved significantly.Part â…¢:Deep Learning-Based Feature Representation (Chapter 5)A deep learning-based feature representation model, named SUGAR, is presented. It can learn data representations from both labeled and unlabeled examples. SUGAR employs a novel architecture to build deep networks,i. e., "auxiliary networks"+"main networks"+"bridge" The auxiliary networks can be easily embedded into the conventional networks, and then pro-vide a sparsely-supervised guidance. These two networks are complementary to each other, since the auxiliary network can learn discriminant features and the autoencoder-based main network can learn generative features. At the same time, the sparsity penalty is employed to make SUGAR more robust and efficient. Then, a mini-batch stochastic gradient descent algo-rithm is proposed to train the SUGAR model. By combining with DAE and CAE, two extended models, named "SUGAR with DAE" and "SUGAR with CAE", are presented respectively. Fur-thermore, multiple SUGAR models are stacked to build a deep learning model:DeepSUGAR. Finally, experiments are carried out on the well-known digit classification problem and eight deep learning benchmark datasets which verifies the effectiveness of the proposed algorithms. It is also shown that the proposed deep learning model can learn good and robust feature repre-sentations, and consequently improve the classification performance.Part IV:Feature Learning for Astronomical Spectrum Recognition (Chapter 6)The characteristic of astronomical spectra and the traditional approaches for spectrum recognition are reviewed. According to the characteristic of stellar spectra, a deep learning-based feature representation model, named LLDL, is presented. It consists of multiple Maxout hidden layers, and employs the Dropout technique to regularize the whole networks. Further-more, a stochastic gradient descent algorithm with Momentum is proposed to train the LLDL model. Subsequently Multi-core CPU and GPU implementation are proposed respectively. Fi-nally, experiments are carried out on the public astronomical big datasets (e.g., SDSS and LAM-OST) which verifies the effectiveness of the proposed algorithm. With a comparison of other machine learning model. SVM, Logistic Regression and deep ReLU networks, LLDL has a better performance on classification and strong anti-noise capability.
Keywords/Search Tags:Feature Selection, Feature Learning, Big Data, Parallel Algorithm, Rough Set, Deep Learning, lAutoencoder
PDF Full Text Request
Related items