Font Size: a A A

Research On The Models And Methods Of Protein Secondary Structure Prediction

Posted on:2005-05-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X WangFull Text:PDF
GTID:1100360152957215Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the approach of post-genome era, proteomics is becoming an important research domain in the life science. Prediction of protein secondary structure (PSSP), with a long historic task, is still a challenge at present in the research of proteomics. Any new breakthrough in this research will be helpful to knowing better the folding mechanism and the function of protein. What is more, it will be an important assistant to relevant industries such as biomedical engineering, ag-bio-tech, etc. Aimed at this need, we develop our work on the study of PSSP using methods of bioinformatics. The main contributions of this thesis are summarized as follows.(1) Artificial neural network (ANN) is a "black-box" model in essence, and the classification or prediction process is embedded in both structure and weights assigned to links between the nodes. Domain knowledge is rather difficult to be extracted from neural network. On the other hand, fuzzy system works on the basis of a set of inference rules which have more comprehensibility; however the difficulties in process of training limit its scope in applications. We combine the neural network with fuzzy inference system and introduce an adaptive fuzzy-neural-network (AFNN) hybrid system. By testing in PSSP problem, the new model can provide fuzzy inference rules with great comprehensibility. These rules can easily be used further to mine more domain knowledge about protein structure.(2) In modelling PSSP using AFNN, as our main contribution, we develop a solution to the dimension reduction for input variables and rules, respectively. Especially, a strategy of combining fuzzy c-clustering (FCC) method and principal components analysis (PCA) method are employed in input selection subtask, and a heuristic strategy together with genetic algorithm (GA) is used in rule reduction subtask. The success of these strategies decreases the complexity of AFNN model and reduces the number of rules to a reasonable level.(3) A careful analysis reveals a basic limitation of the standard hidden Markov model (HMM) in PSSP problem: the conformation and function of a fragment in a sequence of protein may strongly depend on events located both upstream and downstream instead of that just on one side like in standard HMM. A new model, as another main contribution in the dissertation, which combine features of the bidirectional HMM and the recurrent neural network, is proposed. The results of data experiments indicate that new model can achieve 78.1% (2 ~ 3 percent higher than other existing methods) in the accuracy of Q3 in three-state PSSP problem.(4) During modelling PSSP, it is a fundamental operation and a preprocess of some specific tasks to divide a very large categorical data set into some disjoint and homogeneous subsets. We present a parallel clustering method which can implement the parallel clustering of categorical data in the distributed memory environments. Tested with the amino acid data sets on a maximum of 8 nodes the algorithm proposed has demonstrated a very good relative speedup and scaleup in the size of data set.(5) By analyzing the redundancy in database of protein structure, we find that the secondary structure of protein is not decided by its sequence uniquely and completely. However, some variability in secondary structure class is observed at some sites in the sequence. Additionally, more evidences about the bias and frequency of amino acid in the site with high variability are studied. On the basis of these facts we infer that variability in secondary structure may be one of the main reasons for the phenomenon that the accuracy of PSSP could hardly be increased.(6) Which metrics should be used in assessing the performance of PSSP algorithm is a most important problem. We provide a unified overview of those widely used assessing metrics, and then discuss the advantages and disadvantages of each metrics briefly. Based on the probability neural network, we analyze the effect of these metrics on modelling, and indicate some applicable areas and workable principles of these metrics...
Keywords/Search Tags:protein structure prediction, bioinformatics, adaptive fuzzy-neural-network, bidi-rectional hidden Markov model, performance measure
PDF Full Text Request
Related items