| With the continuous development of communication technology and storage technology,the scale of data is becoming larger and larger.When large data brings sufficient information,it will also cause "dimensional disaster".Therefore,data dimension reduction technology has received more and more attention.The core of data dimension reduction technology is to transform highdimensional features into low-dimensional features,where varieties of data dimension reduction methods have been proposed and discussed.Among these methods,feature selection algorithm has been widely used due to its rapidity and comprehensibility.However,using a specific feature selection algorithm alone,which can not take efficiency and accuracy into account at the same time.Therefore,how to combine the advantages of different feature selection algorithms and improve their efficiency and accuracy has become one of the research emphases of more and more researchers,and it is the main focus of this thesis.The thesis was supported by the Natural Science Foundation of Zhejiang Province.Its main research and achievements are summarized as follows:(1)Based on mutual information filter feature selection algorithm,the improved method uses cosine distance to eliminate redundancy.At the same time,it combines with wrapper feature selection algorithm to form a two-stage feature selection method.In the training stage,according to the accuracy of the following different learning algorithms,it uses simulated annealing algorithm to optimize the threshold of the previous two stages of feature selection algorithm.This improved method combines the rapidity of filter feature selection algorithm with the high precision of wrapper feature selection algorithm,which not only improves the accuracy of subsequent learning algorithm,but also reduces the dimension of the final feature subset by strengthening the identification of key features.(2)Aiming at the assumption of feature independence of Naive Bayesian algorithm,this thesis improves it from the point of feature selection.Hierarchical clustering is used to cluster the features with high correlation,and the features which can play the most important role in identification are selected according to mutual information criteria,so as to reduce the dependence of features.In the training stage,according to the accuracy of Naive Bayesian algorithm,particle swarm optimization algorithm is used to optimize the number of clusters,which can further improve the accuracy of the model.(3)Based on PyQt platform and the above two improved algorithms,a visual graphical user interface(GUI)is designed and developed.The software can be applied to gene microarray data,which can not only identify the key features,give the optimal accuracy and minimum feature number of the selected algorithm,but also increase the user experience of the operator. |