Font Size: a A A

Research Of Dimensionality Reduction Algorithm Based On Embedded Sparse Feature Selection Strategy

Posted on:2019-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:X W RenFull Text:PDF
GTID:2370330569978611Subject:Statistics
Abstract/Summary:PDF Full Text Request
The development of high-throughput biotechnology has produced a large number of high-dimension and small sample biological big data.In the field of biomedical "big data",the feature selection algorithm is one of the effective methods to solve the dimension disaster.It has been widely used in gene screening,genetic loci analysis and other specific problems.This paper first introduces the feature selection algorithm and four commonly used classifier models,and then discusses the dimensionality reduction algorithms of embedded feature selection strategy from the following two aspects: One is the resampling technology and embedded feature selection algorithm,the algorithm constructs a new feature selection and sorting algorithm,which can be used to select the key pathogenicity loci according to the data encoded by allele additive effect and heterozygous effect model.;The other is to create a feature selection algorithm based on combining feature sorting algorithm and fixed user-defined feature subset algorithm.In view of the problem of genetic loci analysis in biogenetics,this paper gives the Logistic regression model and the Lasso penalty regression algorithm,which is based on Lasso penalty estimation and resampling technology respectively.In the experimental analysis of the genetic loci coding data of a disease,the selected feature sets and the comparison of the classification performance among the four classifiers under the 5-fold cross validation method are compared.At the same time,we also use the four commonly used classifiers for the top 30 loci to carry out the 5-fold cross validation of gradually increasing the number of features.It is found that the highest accuracy of disease classification can reach 68.13% when the least 27 loci were used.Finally,we also analyzed the data of two encoding methods of additive effect and heterozygous effect,and carried out biological significance analysis on the selected features in the GWAS research database GWAS Central,and found that it was reported to be closely related to a variety of genetic complex diseases such as tumor,hypertension and obesity,which further verified the reliability of the results.For feature sorting algorithm,the features selected from gene expression data is prone to generate redundant features.The feature selection algorithm is proposed in combination with t-test sorting and SubLasso algorithm.In comparison with 3 common feature sorting algorithms that have same number of the top ranking features,the new method is dominant in the classification of 15 common gene expression data sets;The features with excellent classification performance are selected,and robust classification results are obtained on different classifiers.The new algorithm fixes the feature selected by t-test sorting method as predefined feature,and can select the features that are ranked lower in feature sorting algorithm but are significantly related to the response variable.
Keywords/Search Tags:Embedded Feature Selection, Higher-order dimension reduction, Classification, Biological Big Data
PDF Full Text Request
Related items