Font Size: a A A

SELDI-TOF Protein Mass Spectrometry Data Analysis With Semi-supervised Learning

Posted on:2015-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:X L YouFull Text:PDF
GTID:2268330428463975Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Cancer is one of the most serious public problems in the world. Chinese cancer incidence risesgradually, while one-third of cancers can be prevented. The key to treatment it is to improve the earlydiagnosis of cancer. Development of proteomics and related technologies brings the hope for cancerprevention and treatment. The rapid development of protein chip technology bring the dawn for earlydiagnosis of tumors and disease tracking. Studies have shown that, although any signs of unusualcircumstances show in patients while the level of protein has undergone some changes in the early stages ofcancer. And the spectrum which mapping by protein expression data can shows the differences betweenpatients and healthy people. However, to predict the label of sample we also need the help ofbio-informatics. How to improve the accuracy and reliability of cancer prediction using the new diagnostictechnologies is one of the hot issues. The development of machine learning theory promotes pre-diagnosesprogress. While, the mass spectrometry data exists the curse of dimensionality, which make the classifieroverwhelmed. On the other hand, the classification results were questioned with the small sample data.Currently, there are supervised learning and unsupervised learning to extract and classify features.Supervised learning is learning to samples which labeled are known, and unsupervised learning is learningto samples which labeled are unknown. In practical application, to access the labeled samples are difficult,and a lot of unlabeled samples also waste. While, the unsupervised learning also waste labeled samples, sothe semi-supervised learning which only need a small amount of labeled samples and a large number ofunlabeled samples emerged. In cancer diagnosis, a lot of unlabeled samples can easily get, while there willcost a high price to get the labeled samples. And the semi-supervised learning only need a little labeledsamples which can have a better performance to learning the mass spectrometry data, and there will have agood application prospect in improve cancer diagnosis performance.In semi-supervised learning mechanism, the semi-supervised which based on the figure are favored.And the learning with local and global consistency (LLGC) algorithm is one kind of them, which full use ofthe unlabeled samples and the labeled samples. While, the classification performance of this method rely oninternal parameters which bring some inconvenience when to analysis the mass spectrometry data.Therefore, a kind of barebones LLGC (BB-LLGC) algorithm is proposed, which avoid the parametersinterference, while this method also invalid when analysis the data which exists dimension disaster. Weproposed a multi-step feature extraction algorithm based on semi-supervised learning and a method ofclassification based on sparse representation and semi-supervised learning for the high-dimensional andredundant mass spectrometry data. The multi-step feature extraction algorithm based on semi-supervisedlearning first reduce dimension, remove the redundant information and screened the features, which has low correlation and high prediction capability, and then using BB-LLGC to analysis. The main ideas are asfollow. Firstly, preprocessing methods were used, which included reduce the noise and improve thesignal-to-noise ratio. That is removes most of the high frequency noise, and improves the comparativebetween protein mass spectrometry data. And then, we used T-test to screen the spectrometry data. Andpreliminary reduced the dimension of features and remained data with high redundancy and biggercorrelation. Thirdly, the detail features were extracted by using the multi-resolution wavelet decompositionand the features were screened by screened the entropy ranking. Then, the principal components wereextracted by suing principal component analysis. Finally, to take advantage both of labeled samples andunlabeled samples, the labels were predicted by using semi-supervised learning BB-LLGC.Three data sets which include the public ovarian cancer data OC-WCX2b, the public prostate cancerdata PC-H4and the clinical breast cancer data BC-WCX2a by Zhejiang Cancer Hospital were tested by thisalgorithm. The satisfactory classification results are99.13%,96.81%and92.78%respectively and thesensitivity results are99.01%,96.81%and100%respectively. The comparative tests were designed, whichinclude the presence or absence of T-test, DWT and relative entropy sort, PCA and multi-stepdimensionality reduction method. The results showed each step methods can improve the classificationperformance for this algorithm significantly. In addition, we using PCA and KPCA algorithm to reduce thedimension and SVM algorithm based Gaussian kernel function and LDA algorithm to classify in these threedata. The results showed there were no significant differences of classification rate in data set OC-WCX2b,and significant differences in data set PC-H4and BC-WCX2a. The algorithm we proposed have the higherclassification rate is overall level. To further test the performance of the algorithm we proposed, we alsodesigned the comparison test between different classifiers. Use the proposed method to reduce thedimension, then the Naive Bayes algorithm, SVM algorithm and KNN algorithm to classify. The resultsshowed the proposed method had the highest classification rate and the most stable in the data setBC-WCX2a. Test results also showed that using multi-step dimensionality reduction method is effectivefor semi-supervised learning categorize algorithm. The method of classification based on sparserepresentation and semi-supervised learning for spectrometry data was proposed which using KPCAanalysis to reduce dimension firstly, and to construct an adjacency matrix secondly, and semi-supervisedlearning was used.The main idea of this algorithm we proposed is using KPCA to extract principal components of theproteomic mass spectrometry data to make sure that the feature dimensions were fewer than the samplesfirstly, and then the spare vector of each sample was constructed by solving the convex optimizationproblem. Sparse vectors were then taken as the graph weights to construct an adjacency matrix. Finally, totake advantage of both labeled samples and unlabeled samples, semi-supervised learning was used topredict the labels. Three data also tested by this algorithm, and the satisfactory classification results are99.66%,97.35%and92.02%respectively, and the sensitivity results are99.97%,97.61%and98.05%respectively. Inaddition, we using PCA and KPCA algorithm to reduce the dimension and SVM algorithm based GaussiansKernel function and LDA algorithm to classify in those three data. The results showed there were nosignificant differences of classification rate in data set OC-WCX2b and BC-WCX2a, and significantdifferences in data set PC-H4. The algorithm we proposed has the best results. To further test theperformance of the algorithm we proposed, we also designed the comparison test between differentclassifiers. Using kernel PCA algorithm based PolyPlus kernel to reduce the dimension and SVM, LDASRC algorithm to classify. The results showed that the proposed method had better classificationperformance. Meanwhile, we studied the relationship between classification performance and the numberof labeled samples. The results show that the classification rate increases with the number of labeledsamples increase and stabilize when reaches a certain threshold. In a word, the algorithm we proposed has abetter classification performance.
Keywords/Search Tags:Proteomic mass spectrometry, Sparse representation, Multi-step dimensionality reduction, Semi-supervised learning, Feature extraction
PDF Full Text Request
Related items