| The modern medical studies have shown that the disease genes are great significant for the corresponding human diseases.Because of the interaction of disease genes with internal and external environment,there are many common complex diseases.The article about genome-wide association study(GWAS)of age-related macular degeneration published in Science of 2005 opens the studies of the GWAS of complex disease.A lot of achievements have been acquired in the susceptibility analyses of indicidual loci of Single Nucleotide Polymorphism(SNP),but the results of these studies can’t fully explain the genetic mechanism of complex diseases.Most analysis methods of pathogenicity of single-site SNP only focus on the marginal effect between the locus and the disease,so the SNP site with strong association with the disease is identified as the pathogenic site.While the marginal effect is weak,the SNP with strong pathogenic SNP are easily overlooked.More and more scientific studies have shown that interactions between the SNP in the genetic variation of complex diseases play an important role.At present,there are also some methods considering the interaction between the SNP.But for the studies of complex disease,the accurate positioning of SNP has become an urgent problem to be solved due to the superposition of the main effect and the interaction model.Depending on the above problem,this paper proposes a method based on feature selection algorithm to identify pathogenic SNP,that is,through the integration of multiple feature selection methods and optimization algorithms have achieved more accurate identification of pathogenic SNP sites.This method only take into account the main effect of the single point and interaction effect between multiple sites,but also adjust the combination method according to the demand,which has the very good flexibility and extensibility.Base on the analysis and studies on the SNP of whole genome,the research results obtained in the paper are as follow.1.In the field of biology,with the development of biotechnology and the increase of biological data,feature selection method has become a prerequisite for constructing models and analyzing data.In order to further solve the recognition problem of pathogenic SNP loci,we use the feature selection methods such as chi-square independence test,ReliefF,random forest and GA-SVM based on particle swarm optimization,and apply four methods to analyze the experimental data on the simulation data set.By the experimental results it can be seen that the recognition effect of chi-square independence test was poorer.Although this method has a certain effect on the recognition of SNP loci at the single site,it can not accurately find all the SNP locis.ReliefF ranks the pathogenicity of SNP loci according to the feature weight vector,which has some ability to identify the interaction sites.But it is susceptible to noise data,and the experimental results on the simulated data are independent of the chi-square Sex test results are similar,the effect is not significant.The random forest algorithm is used to calculate the pathogenicity of each locus by calculating the Gini value of each SNP site.The method can identify the strong edge of the locus under high dimensional data and can effectively identify the interaction.The experimental results show that this method can effectively identify the SNP.GA-SVM based on particle swarm optimization is an encapsulated feature selection method of integrated machine learning and optimization algorithm which can effectively identify the SNP loci in the data set with interaction and can give a specific size Disease SNP subset,but the method is computationally complex and time consuming.2.By comparing and analyzing the experimental results of the four feature selection methods in the simulation data set,we propose a new method combining random forest and GA-SVM-PSO.This method use the random forest algorithm to calculate the Gini importance value of each SNP site,and a new SNP subset is formed by the top SNP site.Based on the new SNP subset,GA-SVM-PSO algorithm is used to screen the SNP subset.Through the experiment of the verification of simulated data sets and real data showed that the method propose in this paper is superior to random forest,ReliefF,GA-SVM-PSO and other methods to identify pathogenic SNP sites,which is a kind of common complicated disease Pathogenic SNP loci. |