Research On Spark Accelerated Stable And Streaming Feature Selection Algorithms

Posted on:2019-03-17

Degree:Master

Type:Thesis

Country:China

Candidate:L W Fan

Full Text:PDF

GTID:2428330545971548

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The processing of high-dimensional data has always been a difficult problem in data mining.Traditional data mining methods often need to use all the eigenvalues in the calculation process.It is feasible to process data with low feature dimensions,but it will encounter many problems when dealing with data with high dimensionality.For example,the "curse of dimensionality" problem.The data dimension is increasingly high in today's big data era.How to deal with these high-dimensional data efficiently is a research focus of many scholars.The most commonly used method for high-dimensional data processing is data dimensionality reduction.While commonly used method for data dimensionality reduction is feature selection.There have been many achievements in the research of feature selection algorithms,such as the Relief Algorithm.Most of feature selection algorithms mainly consider on how to improve the classification performance of the algorithms.Yet there are fewer studies on the stability of the feature selection algorithms,which is an important issue in the high-dimensional data mining.The stability of the feature selection algorithm means that when a slight disturbance occurs in the data set,the resulting feature subset will not change significantly.Research on how to improve the stability of the feature selection algorithm has achieved certain results in recent years.In this paper,two new stable feature selection algorithms,IW-Relief and FREL,are implemented and studied respectively,and their stability effects will be verified.However,many methods for improving the stability of feature selection cannot reduce the time complexity of the algorithm in the process of solving the stability of the algorithm.Sometimes,in order to achieve the stability of the algorithm,additional time is needed,such as the IW-Relief algorithm.In many applications,there are exact requirements for the time cost of the algorithm,or it is desirable to shorten the running time of the algorithm.Therefore,a new solution needs to be designed to reduce the time overhead of the algorithm.Another important topic of feature selection is the selection of features for streaming data.Most of the data generated on the Internet is streaming,such as financial information,message information,access log,etc.Most of these streaming data require real-time processing.Most of the existing streaming feature selection algorithms deal with data in a serial manner.Therefore,it is also a very significant issue for how to implement the parallelization of streaming feature selection algorithms.Therefore,this paper studies the stability characteristics and streaming feature selection based on Apache Spark.Spark is an open source distributed computing framework.Due to its excellent computing performance and perfect data processing components,Apache Spark has been widely used in data mining,machine learning and other fields in recent years.Using this framework in combination with the corresponding algorithms can achieve parallel computing of the algorithm,which can effectively speed up the algorithm's operating speed.In this paper,based on Spark platform and IW-Relief?FREL and SAOLA,two stable feature selection algorithms and one streaming feature based on Spark,we use 14 public data sets to verify through experiments.There are a lot of factors that influence time in Spark platform.Such as worker,partition,executors,and many more,are all being detailed tested and recorded their outcome on the timing.Comparing the advantages and disadvantages of each algorithms by analyzing the results of each experiments and summarizing the causes of each different results,at the same time,the two parallelized algorithms are compared to analyze the advantages and disadvantages of each algorithm.The experimental results show that the stability feature selection algorithm based on Spark has greatly improved the operating efficiency.According to the experimental results,the maximum time-acceleration ratio of the Spark-based stable feature selection algorithm can reach 8,and the parameters that have the greatest impact on the Spark runtime in each parameter are the partition numbers.Experiment on the streaming feature selection of Spark show that,the acceleration ratio is between 1.4 and 1.6.

Keywords/Search Tags:

Feature Selection, High Dimensional, Stability, Spark, Streaming

PDF Full Text Request

Related items

1	Research On Online Streaming Feature Selection Algorithms
2	Research On Feature Selection And Its Stability For High-dimensional Data
3	Research On The Stability Of Feature Selection For High-dimensional Small Sample Data
4	Mdl-based Feature Selection For High Dimensional Data
5	Research On Feature Selection Algorithms Of High-dimensional Samples Based On Data Characteristics
6	Study On Feature Selection And Ensemble Learning Based On Feature Selection For High-Dimensional Datasets
7	A Two-stage Hybrid Ant Colony Optimization Algorithm For High-dimensional Feature Selection
8	Online Learning Algorithms For Classification Of Streaming Data
9	Research On Feature Selection Methods For High-Dimensional Classification
10	Research On Population Distribution For High-dimensional Optimization And Its Application In Feature Selection