Font Size: a A A

A Feature Selection Algorithm For Biological Data Based On Dynamic Iterative Spectral Clustering

Posted on:2022-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:T F MaFull Text:PDF
GTID:2480306332452514Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of medical big data informationization and the growing momentum of various gene sequence measurement technologies in the field of biological information,its development has realized the automatic acquisition of largescale gene expression data,which makes the type and scale of medical data grow rapidly in an unprecedented way.However,it is difficult to collect samples of diseaserelated gene expression profile data in medical data,and it is difficult to record and collect patient data due to the uneven quality of the data.The number of samples leading to general diseases is small.Another reason is that there are more than 39,000 genes in the human body,and the gene expression profile data of human often contains important information reflecting the cause of diseases,and the features contain a large number of redundant disease features.It is precisely for the above two reasons that the biological gene expression profile data used in this paper is characterized by a very large number of features,but the sample size is often much smaller than the feature number.Therefore,the most important step in the analysis of such biological genetic data with large P and small N is feature selection.Characteristics selection is choosing properties that have strong detailed information from the first property set to form feature subset.We use feature selection to screen out the optimal feature subset that is highly relevant to the classification task,so as to improve the classification accuracy of biological gene data.However,in the gene expression profile data,the genes similar to the disease are very different from the genes of the normal sample species,so the machine learning disclassification can be used to realize the detection and prediction of the disease genes.The field of bioinformatics holds that genes that are functionally similar tend to work together and can be considered as a whole.Biomarkers themselves are correlated,and genes jointly constitute multiple functional subsystems,which together have an important impact on the state of the organism,among which biomarkers play a crucial role.At the same time,these genes are phenotypically related on the gene expression profile,and these similar genes are biomarkers.If the biomarkers related to disease can be mined from human disease genes,it can help medical science better understand,research and treat the disease,and bring good news to the majority of patients.How to select the subset of characteristic genes with high classification ability is the key point for processing biomedical data.To solve the above problems,based on the basis of bioinformatics,this paper proposes a feature selection algorithm BioDynClu based on dynamic iterative clustering and unsupervised learning to mine biomarkers,in order to improve the accuracy of prediction and reduce the loss of effective information of genetic characteristics.In this paper,spectral clustering is used to deal with sparse data more effectively,and clustering results of each category are obtained after the first clustering.The centroid of each class cluster of spectral clustering data is obtained.Clustering CH index is used to evaluate the clustering results,and the optimal clustering is re-screened until the performance is improved to a stable level.Then select the feature subset with the best performance.Experimental tests on 16 gene expression data sets show that the proposed algorithm has better classification and prediction performance on most of the data sets,smaller feature numbers and better stability compared with other algorithms.BioDynClu is able to select for a superior subset of genetic characteristics.Second,it did well on a separate set of tests for colon cancer.The feature selection of gene expression profile data was completed satisfactorily.Finally,the proposed classification algorithm based on biological genetic data can be used and improved in other biological data sets in the future to promote the development of biomedical classification problems.
Keywords/Search Tags:Gene expression profile, biomarker, feature selection, spectrum clustering, dynamic clustering
PDF Full Text Request
Related items