Font Size: a A A

Research On SSBs And DSBs In Feature Extraction And Classification

Posted on:2015-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:W WanFull Text:PDF
GTID:1310330467482988Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid increase of biological data, it is quite a challenging and interesting work to mine knowledge from vast amount of research data, which integrates mathematics, computer science and biology, and the new field of study is called Bioinformatics. Proteins play important roles in genetic processing through interactingwith DNA molecules, such as DNA replication, transcription, translation, cold shock response, repair, and recombinant DNA. However, protein-DNA recognition mechanisms are still unclear at present. With the progress of large-scale sequencing projects and structural genomics projects, more and more DNA-binding protein structures and sequences are available now, which makes it possible to investigate the functions of DNA-binding proteins. In the meantime, the computational methods can be helpful for automatically refining the annotations of the DNA binding proteins, which can help to defuse tensions between the growing data and unclear protein functions, as well as the understanding of the mechanism of protein-DNA binding. There are two kinds of DNAs, single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA). Accordingly, the DNA binding proteins usually consist of single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). SSB binds to ssDNA with high affinity and low specificity, and is mainly involved in DNA replication, recombination and repair. While DSBs involve in binding to particular dsDNA sequences, to modulate the process of transcription, to cleave DNA molecules, or to be involved in chromosome packaging and transcription in the cell nucleus, etc. Though there are some researches focuing on the SSBs and DSBs respectively, few attentions have been paid on investigating what makes SSBs and DSBs have such different kind of binding specificity. Recently, several studies have discussed biological mechanisms of SSBs and DSBs from the view of molecular biology, such as structure, evolution and biophysical characterization. However, to the best of our knowledge, there is no related work to analyse the discriminative characteristics, binding specificity and binding mechanism between SSBs and DSBs using the bioinformatics methods.In this thesis, the mathematical model is constructed by integration of the computation geometry and data mining methods, the models contain the feature extraction and selection, and the optimizing of classification algorithm. This study was divided into four key steps. Firstly, the dataset would be constructed, it first need the theoretical evidence, then to collect the data for analysis and cleaning. We finally obtained the representative data sets with biological and statistical significance. Secondly, how to extract the features becomes the key point from the complex three-dimensional structure and sequence, which also is how to transform between the information and digital signals. Thirdly, the classification algorithms would be properly designed for the features, the process is helpful to realize the classified object. Finally, the evaluation system is a vital importance for quality improvement of classification, such as test methods, detection measures and evaluation index selection, etc. The outline of the works is listed as follows:Based on the global structural features, we investigated the differences between DSBs and SSBs on surface tunnels as well as the OB-fold domain information. We detected the largest clefts on the protein surfaces, to obtain several features to be used for distinguishing the potential interfaces between SSBs and DSBs, and compared its structure with each of the six OB-fold protein templates, and used the maximal alignment score TM-score as the OB-fold feature of the protein, based on which, we constructed the support vector machine (SVM) classification model to automatically distinguish these two kinds of proteins, and got the satisfactory results.Based on the local structural features, we have performed a feature-based analysis and constructed a classification model for the classification of binding residues in the interfaces of SSBs and DSBs. In these models, we presented the residue charge distribution, secondary structure, spatial distance features, residues spatial structure and interface environment features as candidate features, and then the diserete wavelet transform (DWT) was employed to extract details, which have been verified to effectively characterize the interface surfaces of the protein-ssDNA and protein-dsDNA complexes. We employed the SVM and the improved random forest to construct the discriminatory function, and the features are verified by leave-one-out validation with the different classifiers in the datasets. We believed that the features may effectively reveal the implied difference between SSBs and DSBs to facilitate future discovery of the biological mechanisms. The proposed features will deepen our understanding of the specificity of proteins which bind to ssDNA or dsDNA.Based on the analysis of the sequence-based data, we obtained four features:overall amino acid composition (OAAC), dipeptide composition (DPC), physicochemical properties and PSSM profiles. The feature transformation methods (Split amino acid, SAA) were adopted to obtain numeric feature vectors. After ten-fold cross-validation on the dataset, our proposed approach obtained satisfactory results. In the meanwhile, based on the features, we could understand the binding specificity better.In summary, we used data mining technologies in DNA binding proteins to research the various issues and proposed new methods to solve the related problems. Experimental results showed that the proposed methods have a powerful capability of related problems. Therefore, the research could also benefit to promote the prediction and function for DNA binding proteins.
Keywords/Search Tags:Single-stranded DNA-binding proteins (SSBs), Double-strandedDNA-binding proteins (DSBs), Structure feature, Sequence feature, Interface feature
PDF Full Text Request
Related items