Font Size: a A A

Computational tools for MS-based proteomics

Posted on:2008-08-19Degree:Ph.DType:Dissertation
University:Medical University of South CarolinaCandidate:Karpievitch, Yuliya VFull Text:PDF
GTID:1448390005966541Subject:Biology
Abstract/Summary:
A typical high-throughput proteomics experiment creates a large volume of data. Analysis of the data obtained must also be performed in a high-throughput fashion, and this puts a high burden on the CPU, random access memory and nonvolatile storage. Distributed or parallel computing addresses these issues by utilizing computational power and data storage of the remote machines. Some of the distributed or parallel computing falls into a category of grid computing, which we have explored for bioinformatics applications in the mGrid project. mGrid allows automatic user data and user code distribution. A lot of the current bioinformatics and proteomics applications can benefit greatly form distributed or parallel implementation which can be made entirely transparent.;Apart from the motivation of easing CPU load by distributed computing, application distribution also plays a critical rote today when even small local proteomic data analysis require substantial knowledge of programming, not typically mastered by the experimental domain experts. Biomarker identification using protein profiling with mass spectrometry (MS), in particular, is a promising field that has aroused a lot of interest in the recent years and is the applied focus of this work. The reliability and reproducibility of biomarker identification depends extensively on the preprocessing of data obtained from MS instruments. Thus, in the second study, we developed a simple-to-use graphical tool, PrepMS, to enable researches to easily visualize, inspect and prepare ion Time-of-Flight (TOF) MS data for analysis.;Next, we introduced a new implementation and modification (addition) to the existing learning/discriminant algorithm to allow it to deal with data which is clustered or has possible block effects. For MS data it is important to use a learning algorithm as unassuming about the data as possible. Decision trees are a good example of such classifiers. Moreover, a combination of a number of decision trees into a forest becomes an even better classifier as overfitting the data becomes less likely. We developed a new Random Forest-based algorithm in C++: RF++, with novel modifications to accommodate clustered data commonly seen in MS and other biological experiments as well as a graphical user interface.
Keywords/Search Tags:Data
Related items