Font Size: a A A

Application Of Diversity Increment Feature Selection Technique

Posted on:2019-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:S S HuFull Text:PDF
GTID:2370330563997684Subject:Statistics
Abstract/Summary:PDF Full Text Request
Machine learning algorithm is used to classify and identify target sequences or target sites of genome or protein,which is one of the main research contents of bioinformatics.With the increase of the complexity of the research problem,when solving such problems,we often face the situation of small sample size and high dimension.In the classification process,the high-dimensional feature has the over-fitting of the samples,which leads to the reduction of the generalization ability and the abnormal phenomenon of accuracy.Therefore,feature selection technology has attracted more and more attention in data analysis and feature optimization.Because these technologies can extract the basic characteristics of the research objects and improve the recognition accuracy of the models.The central of feature selection technology is to find an optimal subset of features from the entire feature set under the premise of guaranteeing the minimum loss of recognition accuracy.The feature of feature subset should have two basic characteristics.Firstly,the correlation between feature and category is large;Secondly,Small redundancy between features.In recent years,feature selection technology has become one of the most active research contents in machine learning field.When studying the recognition problem of protein flexible site,we propose a new feature selection technique called feature selection technique based on increment of diversity(FSID).In order to further test and improve the FSID method,we apply the FSID method to study the two most active hot issues in the two levels of genome and protein:genome nucleosome positioning sequence identification and protein phosphorylation sites identification.The main conclusions are as follows:Firstly,taking the nucleosome positioning sequence in the yeast genome as sample and the 6-mer component of DNA sequence as the parameter,we used the increment of diversity feature selection technique proposed by us to select eight 6-mers as the classification characteristics.Furthermore,the total accuracy of the 10 fold cross validation is 98.2%using the support vector machine algorithm.The results show that the specific distribution of the k-mer component in the nucleosomal and linker sequences is the main factor that affected nucleosome positioning in yeast.The FSID method greatly reduces the number of features needed to classify the nucleosomal and linker sequences of the yeast,and greatly improves the robustness and generalization ability of the model.Secondly,taking protein phosphorylation sites as samples,a kinase independent phosphorylation site identification model was presented,called FSID_PhSite.The model is featured by component of k-spaced amino acid pairs and the position conservation of residues surrounding the phosphorylation sites.Applying diversity incremental feature selection technique to feature selection and inputting the selected features into the support vector machine algorithm for recognition.When the ratio of positive and negative samples is 1:1,on independent testing dataset validation,the accuracy of identification for serine,threonine and tyrosine sites is 84.34%,82.32%and 68.89%,respectively.The results were superior to the existing kinase independent phosphorylation sites identification model.Thirdly,in addition,based on Heterogeneity Index(HI),the periodicity characteristics of nucleosome positioning sequences in yeast genome were studied.As we all know,nucleosome positioning sequence has two nucleoside(AA/TT/TA)10bp periodic signals,according to the previous periodic study which mostly based on the Fourier analysis.In this paper,we calculate the HI value of the sequence after the base is reduced to W-S.The results show that the nucleoside 10bp periodic signal of the nucleosome positioning sequences in yeast genome was weaker,but the periodicity of 3bp was very strong.Further analysis shows that about 70%of the yeast genomic nucleosomes are located in the coding sequence,which is the reason for the strong 3bp periodicity of the yeast nucleosome localization sequence.
Keywords/Search Tags:Nucleosome positioning sequence, Protein phosphorylation site, Feature selection, Increment of diversity, Support vector machine
PDF Full Text Request
Related items