Font Size: a A A

Construction Of Decoy Sequence Database Based On Deep Learning

Posted on:2020-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZengFull Text:PDF
GTID:2370330590471695Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Identification of protein sequences using computer technology is one of the most fundamental and important tasks in proteomics research.Meanwhile,tandem mass spectrometry combined with database search algorithm is currently the mainstream high-throughput method for protein identification.However,the reliability of identified results are not very high when the mass spectrometry data is searched against the protein sequence database directly;on the other hand,the quality control methods that based on the target-decoy sequence search strategy can estimate the false positive rate of identifications,which can effectively compensate the limitations of the theoretical database search algorithm.In quality control methods,the quality of construction of the decoy sequence database is important to improve the reliability of the results.Generally,there are three methods for decoy sequence database construction,including reversed,randomized and shuffled models.Among them,the reversed models cannot simulate the randomness of the actual data,the number of peptides after randomized and shuffled models is not completely consistent with the target sequence database,and all of them have some limitations.At the same time,with the continuous development of the Human Proteome Project,a huge amount of proteomics datasets have been accumulated,which provides us with the possibility to build a high-quality decoy database using data-driven methods such as deep learning.With the background mentioned above,in order to improve the performance of protein sequence identification,this study introduces the sequence modeling method in deep learning to the decoy database construction process.Construction the decoy sequence is to follow certain rules of the target sequences and generate the sequences that does not exist in the real biological world.The essence of the decoy sequence construction is to generate one sequence from another sequence.Herein,the encoding-decoding strategy in deep learning is applied to deal with the generation of sequences to sequences,and develop a new decoy database construction method.First,the protein amino acid residue sequences are taken as input of the neural network by Word2 Vec model.Then,The Bidirectional Long Short-Term Memory(Bi-LSTM)neural network is used to deal with long-term dependence in coding,and the LSTM neural network is used in decoding.In the coding-decoding framework,the local attention mechanism is introduced to reduce the computational complexity in the network modeland focus on key information to verify the effectiveness of the method.To validate the effectiveness of the proposed method,we used two published mass datasets that generated from adult liver tissue and mouse cochlear sensory neuron epithelial tissue to construct the test datasets.By using this new method,the decoy protein sequence database of human and mouse are constructed and used for MS/MS data identification and quality control,and the results are compared with those of reversed and randomized models.The experimental results show that the composition characteristics of the decoy sequence database generated by this model are similar to those of the target database;and the performance of the this method is shown to be superior to reversed and randomized models on sensitivity of spectrum,peptide and protein identification on different experiment datasets.With the rapid development of high-throughput proteomic sequencing technology,a large amount of mass spectrometry datasets have been accumulated.It's not only poses new challenges to our data processing methods,but also provides opportunities for us to introduce data-driven methods such as deep learning.We believe that with the improvement of this method,we will be able to handle with the challenge effectively.
Keywords/Search Tags:protein identification, target-decoy sequence database, deep learning, Bi-LSTM, attention mechanism
PDF Full Text Request
Related items