Font Size: a A A

Differential Evolution Feature Selection Algorithm For Gene Expression Data On Tumor Subtypes Analysis

Posted on:2019-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:G L XieFull Text:PDF
GTID:2370330548488165Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
DNA(Deoxyribonucleic acid),the genetic material of life,contains thousands of genes whose expression levels can be monitored at the same time,and can be measured by using the chip hybridization and sequencing techniques in single experiment.The scientific research based on massive gene expression data is helpful for us to understand the mysteries of life.Especially,the next generation sequencing(NGS)technology,developing rapidly and generating a large amount of genomic data,is gradually replacing the Sanger sequencing technology.Besides,the emerging third and fourth generation sequencing technology have the advantages of long fragments and single molecules.The application of these high-throughput techniques breaks the limitations of the traditional experimental methods,while their rapid development is accompanied by the accumulation of various organism genome data.Moreover,many large-scale,interdisciplinary and multinational research projects have been starting and developing successively.The corresponding integrated biological database platforms have made tremendous contributions to scientific research and precision medicine.Therefore,mining patterns and deciphering laws of data are the important topics nowadays.In particular,the application of various bioinformatics methods has been successfully solving the problems of classification,clustering,association analysis and so on.In addition,machine learning algorithm can intelligently identify biologically meaningful genes in recognition patterns of gene expression data and help biomedical clinical diagnosis and treatment in fact.Among them,various feature selection methods have become routine tools for pattern recognition and gene filtering in high-throughput gene expression datasets.Although these feature selection methods of supervised learning is computationally efficient and fast for large datasets.But for high dimensional data such as gene expression profiles,with the number of dimensions increasing,the efficiency of these feature selection methods reduce rapidly,and even become NP problem.The successful development of the optimized search algorithm solves the problem.The most widely used heuristic optimization search algorithms such as various evolutionary algorithms have been developed rapidly.Among the evolutionary algorithms,genetic algorithm and differential evolution algorithm show their unique competitiveness in optimization search.Differential evolutionary algorithm has gradually become the focus due to their robustness and fast convergence.Therefore,we use the differential evolution feature selection algorithm,a stochastic optimization algorithm based population,proposed by Ahmed Al-Ani et al to solve the feature gene selection problem on gene expression datasets.However,we discover the shortcomings of the algorithm,and then improve the algorithm to apply to gene expression data sets.Among algorithm,the scaled factor of controlling the evolution rate is considered as a skewed distribution.Then,based on the fact that the spatial structure of the chromosome changes,the fixed step of the gene arrangement is improved and the population evolution fluctuates.Besides,we took into account the generally class-imbalance of gene expression datasets in machine learning,whose training dataset and test dataset have almost the same class composition ratio.And the performance of the classifier model is evaluated by the weighted accuracy to alleviate the effect of sacrificing minority class misclassification.We also took into account the influence of the size of different gene subsets,according to the penalty strategy proposed by Dashtban M.et al,the fitness function is composed of weighted accuracy and penalty term.Furthermore,we inspired by Laura Cantini et al.discovering that the microRNA-mRNA interaction network underlying molecular subtypes.In the tumor subtypes research,the feature gene subsets optimized by the algorithm are related to tumor subtypes studying.The sample relation network was constructed based on the feature gene subset,and the network was filtered by planar maximally filtered graph(PMFG).And the topological graph splitting algorithm was used to explore the feasibility of the tumor subtypes division.In this study,the improved differential evolution feature selection algorithm not only simulates macroscopic species evolution and also imitates the evolution of the spatial neighboring positional relationship between microscopic molecules in order to achieve more closely simulating the activity of the study object.In real data applications,the algorithm shows computing efficiency and good results.Moreover,we use the feature genes as medium to build a sample relationship network in the tumor subtypes exploration,in which each gene can be efficiently discriminated.Then the control group is used as an independent community namely reference.And the PMFG algorithm is used to filter the network and the topological partitioning method achieves the subdivision of the tumor subtypes,which the corresponding feature genes are worth further study and analysis.Although the biological significance of topology partitioning has not been validated,our research work provides a useful reference in the context of big data oncology.
Keywords/Search Tags:Differential Evolution algorithm, Feature gene selection algorithm, Gene expression data, Samples relationship network, Tumor subtypes
PDF Full Text Request
Related items