Font Size: a A A

Research On Parallel Feature Selection Algorithm Based On Spark

Posted on:2021-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhangFull Text:PDF
GTID:2438330602498347Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today's society has entered the era of big data,and the information data in various fields are increasing rapidly.How to mine the valuable knowledge from the mass data with a large amount of chaos and strong interference poses an unprecedented challenge to us.Feature selection is a key step to remove irrelevant and redundant features in current machine learning and data analysis to provide fast and reliable analysis.Among many feature selection algorithms,attribute reduction of rough set is an effective method.After Pawlak put forward the classical rough set theory,many scholars put forward more practical theories such as fuzzy rough set,neighborhood rough set and variable precision rough set.However,in order to obtain the optimal solution for attribute reduction of rough set,the solution space(2~n)needs to be traversed,and the computation is too large.Therefore,the heuristic method of forward search is used to find suboptimal subsets.In addition,for example,the distance between each sample and other samples should be compared in the process of solving the neighborhood rough set in a single time,and the time complexity is reachedO(|C||U|~2).Processing large data sets can take a long time.In view of the time-consuming problem of traditional rough set attribute reduction methods and in order to find a better feature subset,this paper mainly studies from the following two aspects:(1)attribute reduction(feature selection)is an important part of data preprocessing,and attribute dependency is mostly used as the standard to filter attribute subset.A fast dependency calculation method FDC is designed,which can calculate the dependency by directly finding the objects based on the relative positive field,without the need to calculate the relative positive field in advance,which has obvious performance improvement compared with the traditional method in terms of speed.In addition,the improved whale optimization algorithm(WOA)can be effectively applied to attribute reduction of rough sets.Based on the above two methods,a spark-based distributed rough set attribute reduction algorithm sp-wofrst is proposed and compared with another spark-based rough set attribute reduction algorithm sp-rst on two sets of synthetic large data sets.Experimental results show that the proposed sp-wofrst algorithm is superior to sp-rst in both accuracy and speed.(2)when solving the positive domain of the traditional neighborhood rough set,it is necessary to compare the distance between each sample and other samples,resulting in the time complexity of the neighborhood rough set attribute reduction algorithm.Therefore,in the face of large data sets,this kind of reduction algorithm is difficult to get the result in an acceptable time range,and parallel computation is an effective solution to this kind of problem.This paper presents a fast parallel attribute reduction algorithm based on neighborhood rough set theory.By establishing a neighborhood bucket model,the samples in the data set are divided into a series of buckets according to the distance,and the symmetry and transitivity of the neighborhood are utilized to reduce the computational cost of solving the positive domain.It implements a stand-alone version of the algorithm,which takes full advantage of the power of multi-core processors,and a cluster version,implemented on today's popular Spark platform,for multi-node clusters.Experimental results on UCI data and big data show that the parallel algorithm has a significant improvement in efficiency compared with traditional algorithms.
Keywords/Search Tags:Parallel computation, rough set, Apache Spark, Whale optimization algorithm, Feature selection, Attribute reduction
PDF Full Text Request
Related items