Font Size: a A A

Study Of Gene Expression Data Analysis Based On Pattern Recognition Methods

Posted on:2013-01-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J ChengFull Text:PDF
GTID:1228330377459380Subject:Detection Technology and Automation
Abstract/Summary:PDF Full Text Request
The expression levels of thousands of genes can be analyzed by DNA microarraytechnology meanwhile. With this technique, large amounts of gene expression profiles canbe accessed. How to take advantage of these data in the genome-wide, and furthermore toextract the effective biological and medical information has become an important researchissue in post-genomic era. To facilitate the deduction and supplement of functions from theknown genes to the unknown genes, genes can be assigned into common expressioncategories through gene clustering, according to the degree of the similarity. Those similartumor subtypes in morphology can be accurately and objectively identified by geneexpression profiles. Aiming at clustering of gene expression profiles and recongnizingtumor subtype, the following innovative work has been carried out.(1) Traditional Euclidean distance simply regards all the features as similar weightsand the importance of the features are hardly evaluated. For this reason, a novel clusteringalgorithm via optimization on feature’s weights based on feature’s distribution is proposed.The algorithm firstly calculates feature’s inter-cluster condensation ability and feature’sintra-cluster distinguishing ability. Then different weights are assigned to different features.These weights are imported in the similarity computation between the gene expressionprofile and the neurons. The importance of the feature in similarity calculation and weightadjustment are reflected. Experiment results demonstrate that this novel similaritycomputation method can effectively improve clustering precision for gene expression data.There are some problems when using restrictive data to optimize feature’s weights in geneclustering. For improve clustering performance, a semi-supervised clustering algorithmbased on optimization on feature’s weights is proposed. Firstly, features are optimizedaccording to the specified restrictive data. Those features that can effectively distinguishrestrictive data will be assigned to more weights. To avoid the bias in optimization, theeffect of unrestrictive data and the uneven distribution of restrictive data are also considered.In addition, there are inconsistencies in restrictive data. Whether the inconsistent restrictivedata satisfied the must-link relation or not is analyzed by the probability relationship. Thispossibility is adopted by the analysis on feature’s weight. Experiment results demonstrate that the optimization on feature’s weight can effectively improve performance of geneexpression profile clustering.(2) To solve the problem that the traditional clustering results can not be controlled, agene clustering algorithm based on artificial feedback is proposed. The self-organizing mapalgorithm is used as the benchmark algorithm. For performing feedback, several clusterpoints are selected by maximum-minimum theory from data collection and the dense regioncontaining cluster points is determined by the local density methods. The neuron structure isformed through connecting different density regions. The neuron structures are trainediteratively in order to improve the accuracy. This algorithm can dynamically adjust thedistribution of genes in gene classes by manual analysis when the algorithm is running. Tosolve the problem of immutable topology structure of the traditional SOM clusteringalgorithm, a dynamic neuron topology structure is designed. The neuron topology structurecan be changed by dynamically inserting and deleting the edges. Experiment resultsdemonstrate that the accuracy rates can be improved and the results can satisfy user’sdemands.(3) According to the deficiencies of the recent classification methods, an ensemblealgorithm of PSO neural network is proposed to classify tumor subtypes. The genesirrelevant to the classification are eliminated by different correlation functions andcandidate feature subsets are formed. BP neural network based on sensitivity analysis isadopted as the base classifier to learn the subsets and the redundant genes are furtherremoved. The parameters and thresholds of classifiers are optimized by particle swarmoptimization algorithm. Experiments show that the proposed method can obtain betterrecognition rates in tumor subtype identified. Due to the characteristic that the tumor geneexpression profiles have small sample and high dimensions, an ensemble classifieralgorithm is proposed. The candidate subset comprises of genes with higher Fisher ratiovalue. The feature subset that reflects the co-expression behavior of genes and regulationrelationship is established by coefficients and mutual information. Particle swarmoptimization, support vector machine and k-nearest neighbor method are combined to formtwo different base classifiers and their results are assembled by voting method. Experimentresults acquired from the lung tumor subtype recognition confirm the feasibility andeffectiveness of the proposed algorithm.
Keywords/Search Tags:Gene expression data, feature gene, ensembel classifier, restrictive data, semi-supervised clustering
PDF Full Text Request
Related items