Font Size: a A A

Pattern Analysis And Recognition Of Image-based Protein Subcellular Location

Posted on:2016-07-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:F YangFull Text:PDF
GTID:1360330590490803Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Protein is not only the material foundation of life,and also the main bearers of life activity.Taking normal human tissues as an example,it is extremely important for a protein to appear at the right subcellular location at a right time to guarantee its normal functionalities,e.g.finding its correct interaction molecular partners.Protein mislocalization of proteins can result in pathological diseases,including cancer.Therefore,the accurate prediction of protein subcellular locations plays a critical role for understanding specific functions of mammalian proteins.Earlier,to get the original data for protein subcellular location annotations,it is used to employ traditional molecular biological method,and then send to pathologist for visual examination,and finally complete the annotations.Obviously,all is a time-consuming and expensive task.As a matter of course,it is highly desired for the researchers to develop automated protein subcellular localization classification systems with high accuracy and repeatability in the era of big data since year 2012.Currently,most prediction models of protein subcellular localization are based on amino acid sequence.However,sequence-based analysis by itself is not sensitive enough for detection of protein translocation as translocation can be strongly effected by mutations outside the target sequence.For example,mutations in nucleoporin complexes can have dramatic effects on the nuclear localization of multiple other proteins.Basically,sensitively detecting translocated or mislocated proteins in human cancers tissues is more relevant.Recent research has shown that some of translocated or mislocated proteins making impossible for finding their correct interaction molecular partners,and finally affecting the entire molecular biology network.We call this type of protein known as potential cancer biomarker.Focus on translocation or mislocation of these potential cancer biomarkers can effectively improve the accuracy of cancer early warning,and can provide a valuable scientific basis for the molecular target therapy and prognosis work.Based on the motivations mentioned above,a growing number of researchers and institutions move their research data source from amino acid sequence to more intuitive image data source in protein subcellular location prediction,and devote their efforts to the development of bioimage-based classification systems.In recent years,with the rapid development of high resolution imaging technology,it is easier to get high-resolution protein image signal,which means protein subcellular location patterns in normal and cancer human tissues can be more intuitively observed.This significant progress provides high quality data sources for constructing data-driven automated protein subcellular location prediction model.Study on image-based prediction models can not only predict accurately and effectively human protein subcellular localization in normal tissue and cancer tissue,but also provide sensitive in capturing translocated or mislocated proteins in human cancers tissues to screen potential cancer biomarkers.Also,it is crucial important for clinical diagnosis and pharmaceutical engineering.In this proposed study,we develop a few of image-based single-label and multi-label human protein subcellular localization predictor models by employing high-performance local descriptor,efficient feature selection strategies and advanced machine learning theory frame.Moreover,to break the shackles of traditional supervised learning,we have proposed incremental semi-supervised learning framework which is groundbreaking in the field of multi-label Protein subcellular localization.The main contents and creative contributions of the dissertation are summarized as follows:1.Image-based human reproductive tissue protein subcellular localization prediction by ensemble learning global and local featuresIn recent years,people start to care about their own reproductive health because of the increased incidence of tumors in reproductive tissue.In parallel with the increasing degree of social concern,data-driven and automated human reproductive tissue protein subcellular location prediction system design has also become a hot topic both in the field of reproductive medicine and bioinformatics.However,in the current bio-image informatics field,none prediction system has aimed specifically at human reproductive tissue protein subcellular localization.Based on this motivation,we developed an image-based human reproductive tissue protein subcellular location prediction model based on human protein atlas,which is a bountiful source of location proteomics data from Sweden.Our proposed prediction model employed local binary pattern for the first time in bio-image informatics field,and the final results have been demonstrated the local binary pattern feature can improve performance of the proposed prediction model.Besides,in order to refine our proposed prediction model,a lot of simulations have been implemented,for instance,different protein channel separation approaches including linear separation and non-negative matrix factorization,multi-view feature level fusion,and corresponding decision level fusion.Experimental results also show that ensemble strategy can improve the prediction accuracies compared with those independent predictors,and results can be further enhanced if we combine the two different color separation approaches of LIN and NMF separations.The most accurate subcellular of the final ensemble prediction model is mitochondrial,the corresponding accuracy is 95.8%;cytoskeleton is 92%.An overall 85% accuracy is obtained through final proposed prediction system,and when only considering the confident classifications,the accuracy can rise to 99%.In a nutshell,the proposed image-based prediction model can not only help verifying the correctness of annotated reproductive protein,but also can help biologists to speed up annotation unknown protein and lock candidate subcellular location of target protein in advance.2.Image-based multi-label human protein subcellular localization prediction by employing high-performance local feature descriptorsIt have been demonstrated the local binary pattern feature can improve performance of the proposed prediction model in our earlier study.Based on the existing study facts,high-performance local feature descriptors will be more conducive to our proposed model and the overall performance is supposed to improve.Based on this motivation,we therefore focus on the introduction of high-performance local feature descriptor to analysis and mining the micro-pattern of IHC image,and many experiments have been implemented to validate our expectations.The benchmark dataset is the first large-scale multi-label dataset from HPA,which contains 348 proteins composed of 4,364 IHC images based on the combination of the validation score and the reliability score by ourselves.The proportion of multi-label proteins is 25.86%(90/348),and the proposed large-scale multi-label dataset was approved by some of worldwide laboratories.During our experiments,high-performance local feature descriptor,i.e.,local tetra pattern(LTrP)and completed local binary pattern(CLBP)are applied for the first time to describe the local details of multi-label datasets,and compared with the traditional local binary pattern descriptor.Experimental results show that CLBP and LTrP outperform LBP while using to describe IHC images from multi-label datasets.Besides,high-performance local feature descriptors are demonstrated by not only SDA in feature level,but also the ultimate effectiveness of LTrP and CLBP descriptors that have been demonstrated by the classification results.All results of the combination between local pattern feature and global feature have been improved by using the proposed guarantee strategy based on original threshold strategy.The evaluation indexes of BR model feeding by more discriminative local features,such as LTrP and CLBP,can always remain high growth in both single-label and multi-label datasets.3.Image-based multi-label human protein subcellular localization prediction model design using incremental semi-supervised learning frameworkAlthough the automated prediction models have many advantages,such as high accuracy and reproducibility,most of the relevant research works are based on high staining level IHC images.However,the high stain level IHC images account for only 13% in Human Protein Atlas(version 11),and the medium stain level IHC images account for 31%.Obviously,the amount of medium stain level IHC images is as twice much as high staining level and the diversity of training set can be greatly enhanced if taking medium stain level IHC images into account.It is clear that the bottleneck can be attributing to the machine learning model because most prediction model constructing based on the framework of supervised learning,which cannot take full advantage of medium stain level IHC images.Based on this motivation,a new semi-supervised protocol,which can take advantage of medium stain level IHC images in model construction phase by an iterative and incremental training strategy,for the first time,has been proposed and applied to deal with multi-label human protein subcellular localization prediction.During the decision phase,different from our earlier study,a new dynamic threshold criterion,named D-criterion,is design for the multi-label benchmark datasets instead of top-criterion,and the effectiveness of D-criterion had been demonstrated by the experimental results.The finally experimental results show that the performances of our proposed method are better than existing approaches such as LDS and CS4 VM on the multi-label dataset of this paper.4.The applications of image-based protein subcellular localization prediction modelWe first predict IHC images from cancer tissues by using models proposed from our early study.Then,comparing the output data from normal and cancer tissues to capture the difference between these two different tissue conditions by three different decision-making methods,namely,maximum likelihood approach,plurality voting approach and majority voting approach.Experimental results show that 2 proteins from male reproductive tissue and 2 protein from female reproductive tissue were screened,and the application of image-based protein subcellular localization prediction model can not only help refine target proteins due to translocated or mislocated in cancer tissues,and also can provide a reference for the treatment of human disease and provide the necessary experimental reference and supplementary guidance for early stage of drug development.
Keywords/Search Tags:bio-image informatics, human protein atlas, subcellular localization, local feature, feature selection, multi-label learning, ensemble learning, t-test, supervised learning, semi-supervised learning, potential cancer biomarker
PDF Full Text Request
Related items