Font Size: a A A

Bioinformatic Approach For Mass Spectrometry-based Glycomics

Posted on:2014-05-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:G XuFull Text:PDF
GTID:1220330425973291Subject:Bio-IT
Abstract/Summary:PDF Full Text Request
Mass spectrometry (MS) is a powerful technique for the determination of glycan structures and is capable of providing qualitative and quantitative information. With the development of glycomics, a large number of glycan structures were determined through interpretation of glycan profiling derived from mass spectrometry. Recent development in computational method offers an opportunity to use glycan structure databases and de novo algorithms for extracting valuable information from MS or MS/MS data. The Consortium for Functional Glycomics (CFG) has supplied a web-based resource that makes it easier to acquire glycan profiling experiment datasets and glycan structure information. However, most glycan profiling data generated by MS need to be manually annotated and the work is time-consuming, unreliable and inaccurate. Furthermore, detecting low-intensity peaks that are buried in noisy data sets is still a challenge and an algorithm for accurate prediction and annotation of glycan structures from MS data is highly desirable.The present study describes a novel algorithm for glycan structure prediction by matching glycan isotope abundance (mGIA), which takes isotope masses, abundances, and spacing into account. We constructed a comprehensive composition library containing808glycan compositions and their corresponding calculated glycan isotope abundances. We used samples from the CFG database to construct candidate composition datasets in conjunction with an effective data preprocessing procedure, which included baseline subtraction, smoothing, peak centroiding and a library-based composition matching method for extracting detected glycan isotope profiles from MS data. Unlike most previously reported methods, not only did we take into count the m/z values of the peaks but also their corresponding logarithmic Euclidean distance of the calculated and detected isotope vectors. We further predicted the overlap regions from the matching process. If the m/z difference of two detected ions was close to an integer in the range of1to4, and each peak had a matched theoretical composition, we considered the m/z range, from the lower m/z to higher m/z+5, as a potential overlapped region. We found more than20potential overlap regions in every sample. Noteworthily, we improved mGIA algorithm through constructing an optimization model so that it can deconvolute the glycan isotopic clusters in each potential overlap region.Evaluation against a linear classifier, obtained by training candidate composition datasets from three different human tissue samples in CFG profiling database in association with Support Vector Machine (SVM), was proposed to improve the accuracy of automatic glycan structure annotation. The algorithm was validated by analyzing the mouse kidney MS data from CFG, resulting in the identification of6more glycan compositions than the previous annotation and significant improvement of detection of weaker peaks compared with the algorithm previously reported. Because the imbalance of the training datasets from7CHO samples in CFG profiling database influenced annotation results, we combined Supporting Vector Machines (SVMs) algorithm with different sampling techniques, such as Synthetic Minority Over-sampling Technique (SMOTE), to classify all potential candidate compositions. In average of all the samples, the results exhibited26.8%increase in annotation sensitivity through SMOTE-SVMs algorithm. Based on the algorithm, we have developed a tool named GlycoMaid to help users to automatically annotate N-glycan MS data with glycan composition and list the annotation confidence as well as the possible structure links in CFG database. The package and source code can be obtained from http://code.google.com/p/glycomaid/.In order to expand the candidate structures for each annotated glycan composition, we simulated the biosynthetic process in the ER and Golgi complex with possible enzyme reaction rules. The results indicated the presence of many false positive structures for compositions with higher m/z values. We also attempted to use tissue information to filter the isomers in CFG structure database. Unfortunately, the performance was poor because the available bioinformation is limited.In this thesis, we developed an mGIA algorithm for automatic interpretation of MS data and accurate annotation of glycan compositions and structures. This algorithm is especially useful for analyzing low abundant peaks and handling the overlapping glycan isotopic clusters.
Keywords/Search Tags:Bioinformatics, Glycan, Isotoe, Mass spectrometry, Support Vector Machine(SVM), SMOTE, Imbalance dataset, CFG database
PDF Full Text Request
Related items