Font Size: a A A

Research And Application Of Permutation Test In High Dimensional Gene Data

Posted on:2013-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:C J LuoFull Text:PDF
GTID:2218330362966310Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
DNA microarray provides a powerful tool for the study of gene function, and hasnumerous practical applications of knowledge acquired from such data in medicaldiagnostics, treatment planning, drugs development and many more. Due to the high costof experiments, gene expression data usually contain a large number of genes but a smallnumber of samples. Therefore, many traditional methods have difficulties in handlingsuch highly dimensional data and due to low number of training samples tend to over fit.On one hand, there are a large number of noisy genes; On the other hand, there are a largenumber of redundant genes. Noise and redundant genes will not only lead to theclassification over-learning, and also lead to a sharp rise in computational complexity.Therefore, the processing of gene data, in essence, is a data mining process with thecharacteristics of few-objects-many-attribute, hence gene selection is particularlyimportant.In this thesis, RSCTC'2010Discovery Challenge: Mining DNA Microarray Data forMedical Diagnosis and Treatment Discovery Challenge is firstly introduced. The averagerecognition is0.7566. Based on the high dimensional gene data, we proposed a two-stepsgene feature selection algorithm based on permutation test; and then, a novel method ofmeasuring the importance of gene based on random sequence is designed. The maincontributions are listed as follows:(1) In order to filter noisy and redundancy genes, a two-steps gene feature selectionalgorithm is proposed. Firstly, two deficiencies of current gene feature selection areanalyzed:1) Lacking of efficient method to determine the number of gene to select.2)Lacking of efficient method to remove redundancy gene. For problem1, permutation testmethod is adopted, and then, the proposed algorithm can select gene efficiently andprocess large dataset quickly. For problem2, minimum redundancy maximum relevanceapproach is adopted, first removing noisy gene, second removing redundant gene. Twelvedatasets from RSCTC2010Discovery Challenge with SVM and PAM classifiers areadopted to evaluate the performance of the proposed algorithm. The experiment resultsshow that the small gene subset with high discriminant and low redundancy can beselected efficiently by the proposed algorithm.(2) Many methods rely on the priori knowledge of small distance within groups andlarge distance between groups, assuming that the data follow a specific distribution, but cannot measure the unknown-distribution gene data accurately. For the problem, a novelmethod of measuring the importance of gene based on random sequence is designed. Themethod adopts the random of decision sequence to measure gene importance, andcombine permutation test to identify significant genes. The experiment results show that,the method can apply to the unknown-distribution gene data, select feature genesinitiatively to improve the performance of most classifiers.
Keywords/Search Tags:DNA microarray, High Dimension Data, Permutation Test, Feature Selection, Random Sequence
PDF Full Text Request
Related items