Font Size: a A A

Based On Comprehensive Feature Representation To Predict The DNA-Binding Sites

Posted on:2022-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:J W NiFull Text:PDF
GTID:2480306728460634Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Protein-DNA interaction plays an important role in life mechanism of organisms.DNA-binding sites are amino acid residues that specific binding between protein and DNA,which is the key to understand the interaction between them.The computation method is the hot research direction to accurately identify DNA-binding sites of protein sequence.The main content is discussed as follows:(1)A novel comprehensive features of DNA-binding sites are proposed.For any a protein sequence,1431-dimension features are extracted based on sliding windows and statistical indices to distinguish DNA-binding sites and non-DNA binding sites,including position-specific scoring matrix,hidden markov models profile,charge and polarity,relative solvent accessible area,disorder score,secondary structure and physicochemical properties.(2)The two-layer feature selection method is designed to gain the optimal feature subset.The first-layer utilizes Chi-Square test,Information Gain and Minimal Redundancy Maximal Relevance algorithm to remove irrelevant features.In the second-layer,the random forest algorithm is employed to calculate the feature importance and select the optimal feature subset to identify DNA-binding sites.(3)The prediction model of DNA-binding sites is established.In order to eliminate the data imbalance,the new dataset is built by using random undersampling method.Moreover,the prediction accuracy is improved with support vector machine.(4)Empirical analysis.The prediction model is established on the benchmark dataset YK17?DNA(#Training).Meanwhile,the features affecting the protein-DNA interaction are analyzed.The prediction performance of our model is compared with other methods in benchmark dataset YK17?DNA(#Test)and independent dataset MW15?DNA.The 6 methods are compared with our method,including Bind N+,DBS?PSSM,DRNApred,COACH-D,SVMnuc and Nuc Bind.The AUC of our model is 0.010-0.130and 0.007-0.124 higher than these methods in benchmark dataset YK17?DNA(#Test)and independent dataset MW15?DNA,respectively.In order to further prove effectiveness of our method,we compare our method with other three latest methods(the web-server are available)in prediction performance.Our model performs sensitivity and F1 increase by 0.1235-0.2974,0.013-0.195 in independent dataset MW15?DNA,respectively.To sum up,the proposed computation method can accurately identify DNA-binding sites.
Keywords/Search Tags:DNA-binding sites, Comprehensive features, Two-layer, Random undersampling, Support vector machine
PDF Full Text Request
Related items