Font Size: a A A

Sparse Autoencoder-based Feature Engineering Algorithm And Its Applications In Bio-OMIC Data

Posted on:2022-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ZhangFull Text:PDF
GTID:2480306329974459Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cancer is by far the most complex disease,the types of cancer are diverse,and each cancer embodies different molecular characteristics.To meet people's needs for a better life,researchers need to have a deeper understanding of cancer.The continuous development and advancement of science and technology has made it possible for researchers to obtain cancer genome information.The emergence of the TCGA(The Cancer Genome Atlas)database has allowed more researchers to learn about the alterations that cancer causes at the genetic level based on genomic analysis techniques,and then they go through experiments,analysis,and judgment based on these circumstances,to better diagnose and treat cancer.It records the bio-omics data of cancer patients from multiple angles.Among them,DNA methylation sequencing technology is very important,and the DNA methylation information that it measures contains many important gene information.We can achieve the purpose of controlling gene expression,preventing and controlling diseases by studying them.Similar to most bio-omics data,there are more than 480,000 features in each patient's genetic information in the DNA methylation dataset of the TCGA database.However,due to the limitation of the sample size,it is not possible to directly predict the original dataset using the classification model.Traditional machine learning mainly uses feature selection algorithms to screen original features and select the best feature subset to conduct experiments.To study the internal relationship between features,this research proposed a feature engineering algorithm based on sparse autoencoder to construct DNA methylation datasets and conjectured that features constructed by sparse autoencoder would have a better predictive performance.Sparse autoencoder is an unsupervised machine learning model.It can continuously train by calculating the error between the input and the output in the model,and finally use the intermediate variables of the model as structural variables to perform information compression and other tasks.It has been applied to image recognition,speech recognition,fault diagnosis and even recommendation systems,which has shown good performance in extracting data features and so on.A comprehensive evaluation and experiment were carried out using 3494 methylation samples from six cancer types from TCGA.First,the constructed features were obtained through the sparse autoencoder,and then these features were ranked using feature selection algorithms,and finally they were classified using strategies such as cross validation and incremental feature selection.The final classification results were analyzed and compared with those of the original features.Meanwhile,this study also conducted an analysis of the intergroup variability of the constructed features,compared the classification effects between the same ranked or even low ranked constructed features with the original features.The experimental results show that in most modeling experiments conducted in this study,the effectiveness of the feature engineering algorithm based on sparse autoencoder can be proved from different directions and angles,the constructed features are better than the original features.Since the sparse autoencoder has a better compression effect on features,the constructed features can reflect some hidden information in the data.When this model is used for early diagnosis of thyroid cancer and classification of engineering data in non-bioinformatics fields,similar improvements have been achieved.
Keywords/Search Tags:sparse autoencoder, feature engineering, DNA methylation, feature construction, feature selection
PDF Full Text Request
Related items