Font Size: a A A

Research And Optimization Of Ensemble Feature Selection Algorithm For High-dimensionality And Small Sample Biological Data

Posted on:2022-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ShengFull Text:PDF
GTID:2480306761959439Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
In the era of medical big data,cancer was assuredly one of the most serious diseases.It brought a huge burden and pain to the society,but the cause of cancer pathology was still unknown.Fortunately,in the era of big data,a large number of Gene expression data has been accumulated.However,these data generally have the characteristics of high dimension and small sample.High dimension not only increased the computational complexity,but also affected the performance of machine learning algorithms,which led to dimension disaster.If the number of samples was too small,it was easy to cause overfitting and unbalanced distribution of samples.Studies have shown that only a part of gene subsets in the genome played a key role in the overall expression level.Therefore,for gene expression data that conforms to the characteristics of high dimensionality and small sample size,effective and reasonable data dimensionality reduction techniques became an important way to solve this problem.In response to the above problems,this reserch proposes the FVR algorithm: Filter algorithm based on Voting and Recursive feature elimination strategy.First,a pan-cancer dataset containing multiple cancer types was constructed,and the performance of the FVR algorithm was tested on the pan-cancer dataset.Compared with related research,the FVR algorithm had an improvement of 3.53% in the accuracy index,and an increase of 3.91% in the F1 score index,which proved that the FVR algorithm framework could effectively filter out key features.Moreover,the FVR algorithm was optimized in parallel to increase its operation speed by 20%.In addition,this paper constructed a pan-cancer risk model based on the FVR algorithm,and tested the performance of the pan-cancer high-risk gene subset containing 39 genes in the pan-cancer risk model.Then,the pan-cancer risk model was tested across datasets and independent datasets to verify its generalization.Afterwards,the genes in the pan-cancer high-incidence gene subset were evaluated and analyzed to verify their biological functions.Finally,according to the implementation framework of the FVR algorithm,an ensemble feature selection algorithm toolkit: Hi Feature was developed.As part of the pan-cancer risk model,Hi Feature included a variety of feature selection algorithms suitable for high dimensionality and small sample biological data,and the Hybrid Filter Strategy module and Combined Voting Strategy module were conducive to the study of more efficient ensemble feature selection algorithms.The parallelization module would help to improve the running speed of the algorithm.Hi Feature improved the functions of the pan-cancer risk module and improved its scalability.In summary,this reserch proposed a novel ensemble feature selection algorithm FVR,which was suitable for high dimensionality and small sample biological data.Its core idea was to integrate a variety of strategies and achieved good application results.The parallel optimized PFVR algorithm greatly improved the running speed of the algorithm on the premise of ensuring the performance.Experiments had shown that the pan-cancer risk model constructed based on the FVR algorithm has certain generalization and validity,and most of the genes in the pan-cancer high-incidence gene subset had been confirmed to be closely related to the development of cancer.The ensemble feature selection algorithm toolkit Hi Feature based on the FVR algorithm was helpful to promote the research of ensemble feature selection algorithm,and further improved the pan-cancer risk model.The research done in this paper provided a feasible and novel research framework for the integration of ensemble feature selection algorithm and cancer genomics.
Keywords/Search Tags:High Dimensionality and Small Sample Biological Data, Gene Expression Data, Feature Selection, Ensemble Method, Parallel Computing
PDF Full Text Request
Related items