Font Size: a A A

Feature Selection Parallelization Based On Cloud Platform

Posted on:2016-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:J LuFull Text:PDF
GTID:2308330473464473Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Machine learning is a research hot topic in the field of artificial intelligence, it also has attracted great attention in the industry. However, as the big data with the complex characteristics is coming, the traditional machine learning algorithms used to deal with small data sets can not work well or even infeasible. Therefore, research on machine learning algorithms for large scale data has been the common focus of the academic and industry domain. Feature selection is one of key problems in machine learning to preprocess high dimensional data. Then the traditional feature selection algorithms should be improved to meet requirements of the high dimensional large scale data. With the emerging of cloud computing, parallel computing is one of the most popular ways to deal with large scale data. Therefore, we like to combine cloud platform with the feature selection algorithms to effectively deal with the high dimensional large scale data. The main research of this paper is listed as follows:First, we designed the parallel feature selection algorithm D-logsf using Google’s MapReduce programming model. The parallelization works of D-logsf lie in the sample similarity calculation problem and gradient optimization algorithm parallelization issues. Moreover, some experiments have been conducted on the real and synthetic data sets, the experimental results show that the parallel feature selection algorithm D-logsf has a good reliability and scalability, and it can obtain approximate linear speedup compard with the traditional feature selection algorithm Logsf.Second, we desgined and developed the feature selection algorithm system RELIEFSYS,which is based on browser/server model. The system is data-centered one with some characteristics such as a data security, user-friendly interface, interaction, extension and parallelization. Most remarkable characteristic of the system is to provide feature selection algorithm for parallelling by visualization operation. Meanwhile, we adopted data driven development model and provided the interface to registe other algorithms, so we ensure that the system can have a very good algorithm’s scalability.
Keywords/Search Tags:Feature Selection, Local Learning, MapReduce, Parallelization, System Development
PDF Full Text Request
Related items