Computational tools for MS-based proteomics

Posted on:2008-08-19

Degree:Ph.D

Type:Dissertation

University:Medical University of South Carolina

Candidate:Karpievitch, Yuliya V

Full Text:PDF

GTID:1448390005966541

Subject:Biology

Abstract/Summary:

A typical high-throughput proteomics experiment creates a large volume of data. Analysis of the data obtained must also be performed in a high-throughput fashion, and this puts a high burden on the CPU, random access memory and nonvolatile storage. Distributed or parallel computing addresses these issues by utilizing computational power and data storage of the remote machines. Some of the distributed or parallel computing falls into a category of grid computing, which we have explored for bioinformatics applications in the mGrid project. mGrid allows automatic user data and user code distribution. A lot of the current bioinformatics and proteomics applications can benefit greatly form distributed or parallel implementation which can be made entirely transparent.;Apart from the motivation of easing CPU load by distributed computing, application distribution also plays a critical rote today when even small local proteomic data analysis require substantial knowledge of programming, not typically mastered by the experimental domain experts. Biomarker identification using protein profiling with mass spectrometry (MS), in particular, is a promising field that has aroused a lot of interest in the recent years and is the applied focus of this work. The reliability and reproducibility of biomarker identification depends extensively on the preprocessing of data obtained from MS instruments. Thus, in the second study, we developed a simple-to-use graphical tool, PrepMS, to enable researches to easily visualize, inspect and prepare ion Time-of-Flight (TOF) MS data for analysis.;Next, we introduced a new implementation and modification (addition) to the existing learning/discriminant algorithm to allow it to deal with data which is clustered or has possible block effects. For MS data it is important to use a learning algorithm as unassuming about the data as possible. Decision trees are a good example of such classifiers. Moreover, a combination of a number of decision trees into a forest becomes an even better classifier as overfitting the data becomes less likely. We developed a new Random Forest-based algorithm in C++: RF++, with novel modifications to accommodate clustered data commonly seen in MS and other biological experiments as well as a graphical user interface.

Keywords/Search Tags:

Data

Related items

1	Seismic Achievement Data ETL Platform Architecture Design And Software System Implementation
2	The Research And Application Of Data Preprocessing In XML Data Warehouse
3	Research On Related Issues Of Unstructured Data
4	The Data Integration、analysis And Utilization For Hosiptal Information Based On The Data Warehouse
5	Design And Implementation Of Data Mining Support Subsystem Based On Big Data Of Power
6	Design And Implementation Of Environmental Monitoring Data Management System
7	Research On The Problems And Countermeasures Of Domestic Data Journalism Practice
8	Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration
9	Big Data And Research Of Big Data In Modern Internet Applications
10	Design And Implementation Of The Bayonet Data Integration Platform