Font Size: a A A

Research On Feature Selection And Sample Imbalance For Gene Expression Data

Posted on:2021-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2480306032966369Subject:Statistics
Abstract/Summary:PDF Full Text Request
Gene plays an important role in recording and transmitting genetic information.Whether the gene is expressed correctly or not directly determines the safety and health of life.Gene expression data provides us with a lot of data and information to study the disease caused by gene mutation.How to effectively use gene expression data for disease diagnosis and related drug research has become an important subject of interdisciplinary research,such as medicine,bioinformatics,statistics and so on.Gene expression data usually has thousands of genes.Reducing the feature dimension is an essential step in the data preprocessing stage.Feature selection can effectively select the features that represent the data set information,remove the impact of noise,and reduce the workload of subsequent sample classification.In addition,in many existing data sets,there is imbalance in the distribution of samples,which also makes classification more difficult.Therefore,for gene expression data,this paper makes further research on feature selection,sample balance and sample classification.The main work is as follows:(1)Metrics are very important for feature selection methods.Different metrics often achieve different results.In this paper,three common metrics are used in FCBF algorithm,and KNN,discriminant analysis and random forest are used to classify the impact of the metrics on FCBF algorithm.The results show that SU and HSU have similar control effects on the feature subset,but in classification accuracy,SU is more stable and the fluctuation is smaller.NSU can control the number of feature subsets and achieve high classification accuracy in different data sets.(2)In general,the number of feature subsets can be controlled by adjusting parameters.Because FCBF algorithm itself cannot control the number of feature subsets,we introduce parameter control method to control the number of feature subsets through parameters,and give the FCBF algorithm with parameters.Experimental results show that too small or too large feature subsets are not conducive to sample classification,which will reduce the accuracy of classification.On the three data sets of BREAST_A,COLON and MULTI_A,although the feature subset selected by the original parameters has obvious redundancy removal effect,it will also remove a lot of useful information.(3)In this paper,an improved smote algorithm based on sample mean distance is proposed to solve the problem of sample imbalance.In resampling,interpolation is carried out at the quantile according to the distance from the sample mean value from large to small,which avoids the problem of fuzzy boundary.At the same time,the results of sampling algorithm before and after dimension reduction are compared.The experimental results show that the sampling experiment under the optimal parameters is not necessarily better than the original parameters,and the effect of sampling before the use of data sets with large sample distribution imbalance is the best,and the rest of data sets can be sampled after the use.(4)In the experiment of sample classification,this paper proposes an algorithm based on the Euclidean distance between the classification samples and the mean of all kinds of samples in the training set.Experiments show that the sample mean classification algorithm can replace KNN and discriminant analysis classifier,and can achieve better classification on the data set except BRAIN,MULTI_A,ALL_AML.Compared with the existing feature selection and classification algorithms,the balanced sample mean and classification method proposed in this paper is more simple and applicable.According to different data features and results requirements,the selection criteria can be changed at any time,and can show better classification effect in most data sets.However,feature selection is a process of removing redundancy,and balancing samples is a process of increasing.How to combine the two processes better,reduce the loss of information,and avoid increasing redundant information and noise at the same time is a problem that needs further study in this paper.
Keywords/Search Tags:gene expression data, feature selection, parameter discussion, sample balance, sample classification
PDF Full Text Request
Related items