Study On Feature Selection Algorithm Based On Structured Data

Posted on:2022-09-06

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X T Cui

Full Text:PDF

GTID:1488306728482414

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The data acquisition technology develops rapidly in the field of �big data'.With the increase of data volume and data dimension,processing and analyzing data has become a challenge.Data in the fields of biomedicine,e-commerce,computer vision,etc.has a large number of irrelevant and redundant features,feature selection is able to select valuable features and helps to improve the performance of data preprocessing,data classification and data visualization.In this paper,researches on filter algorithm,wrapper algorithm and hybrid algorithm are carried out.The filter algorithm selects feature subsets according to the characteristics of the data,and does not rely on the performance of the learning algorithm.The computational cost of the filter algorithm is low,but the classification performance is uncompetitive.The wrapper algorithm employs the performance of the learning algorithm as the evaluation criteria.Thus,wrapper algorithm has excellent classification performance but relatively high time complexity.The hybrid algorithm combines the advantages of the filter algorithm and the wrapper algorithm to balance classification performance and time complexity.However,the feature selection algorithm still has the following three shortcomings:(1)Ignoring the instance distribution,instance force coefficient distribution and redundancy;(2)Convergence speed is too fast,and it is easy to fall into local optimum;(3)The candidate subset is not rich enough.Aiming at the three shortcomings of the feature selection algorithm,this paper proposes three improved feature selection algorithms around the evaluation criteria such as the classification accuracy rate and the number of selected features.The main contents are as follows:(1)A multi-directional Relief algorithm(MRelief)is proposed.First,the multidirection neighbor search method is used to find all neighbors in different directions within the distance threshold,and obtain the neighbor samples that are regularly distributed.Therefore,the feature weights output by MRelief are more accurate than those from Relief.Second,a novel objective function that incorporates the instances force coefficient is proposed by MRelief to reduce the influence of noise.Thus,the new objective function improves the classification accuracy rate of Relief.Then,combined with the Maximum Pearson Maximum Distance(MPMD),MRelief proposes a subset generation method to generate a promising candidate subset.The new subset generation method contributes to reduce the redundancy between features.Finally,the multi-class extension method is used to deal with multi-class data.A large number of experimental results on 9 UCI data sets and 11 microarray data sets show that the performance of MRelief is significantly better than other eight algorithms.(2)A global chaotic bat algorithm(GCBA)is put up forward.First,GCBA uses chaotic mapping for population initialization to cover the entire solution space.In addition,to enhance the global search ability,GCBA records the local optimal position and the global optimal position when each bat updates position.Finally,to improve the exploitation ability,this paper proposes an improved transfer function which transfers continuous search space to discrete binary search space.To verify the effectiveness of the GCBA algorithm,GCBA and six comparison algorithms are tested on 12 UCI data sets and 5 microarray data sets.Compared with other algorithms,the results show that the GCBA is conducive to obtaining better classification accuracy rate and faster convergence speed.(3)A hybrid improved dragonfly algorithm(HIDA)is presented.First,to generate a promising subset,a feature with a larger weight is selected to the candidate subset with a high probability,and a feature with a smaller weight also has chance to be selected into candidate subset with a small probability.The diversity of the candidate subset is enhanced to avoid the HIDA algorithm from falling into the local optimum.Second,dynamic swarming factors are conducive to balancing exploitation and exploration capabilities.Last,to enhance the exploitation capability,quantum local optimum and global optimum are introduced to the position updating mechanism.The performance of HIDA is investigated on 8 UCI data sets and 10 microarray data sets.Results show that the performance of HIDA is superior to other 6 algorithms.In summary,this paper puts up forward three improved feature selection algorithms to solve the shortages of filter algorithm,wrapper algorithm and hybrid algorithm.For structured data,the experimental results show that MRelief,GCBA and HIDA have achieved excellent classification performance,which helps to improve the performance of data preprocessing,data classification and data visualization.

Keywords/Search Tags:

Feature selection, Filter algorithm, Wrapper algorithm, Dragonfly algorithm, Bat algorithm, Relief, High-dimensional and small-sample data

PDF Full Text Request

Related items

1	Classification Of Non-equilibrium High-Dimensional Small Sample Data Based On RF And LSSVM Models
2	Research On Heuristic Hybrid Feature Selection Method For Structured Data
3	Research On Feature Selection Method Based On Dragonfly Algorithm And Flower Pollination Algorithm
4	Research On The Dragonfly Algorithm Based On Enhancing Individuals' Flight Direction And Its Application
5	Comparative Study On Classification Methods Of Two High-Dimensional And Small Sample Data
6	Feature Selection Algorithm Based On Filter And Wrapper Fusion And Its Application
7	Research On The Stability Of Feature Selection For High-dimensional Small Sample Data
8	Research On Data Feature Selecting And Data Balancing Methods Based On Genetic Algorithm
9	Research On Quantum Evolutionary Computational Methods And Its Application In Feature Selection
10	Study On Feature Selection And Ensemble Learning Based On Feature Selection For High-Dimensional Datasets