Font Size: a A A

Study On Identification Of Saliva-secretory Proteins Based On Machine Learning

Posted on:2016-04-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:1220330467995431Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine learning is a discipline, specializing in how to use computers to simulate or realizehuman learning activities to acquire new knowledge or skills, reorganize existing knowledgestructures so as to continuously improve their performance. It has showed talents in manyareas of computer science, and becoming an important supporting technology of someinterdisciplinary researches. Therefore, machine learning is a very important research field incomputer science and artificial intelligence.Classification and clustering are two important and commonly used methods in machinelearning. Clustering belongs to unsupervised learning method for its ‘cluster’is unknown. It isvery widely used. The clustering problem of low dimensional data has been solvedsuccessfully by the traditional clustering. However, due to the complexity of data in practicalapplication, the existing algorithms often fail, especially for high-dimensional data and largedata. Support vector clustering is emerging as a clustering method of this century. It has asolid theoretical foundation, and can generate arbitrary shape cluster boundary, analyze thenoisy data and separate overlapping clusters, which couldn’t be handled by other methods.However, there are two bottlenecks in this method: Lagrange multipliers’ calculation andadjacency matrix’s calculation, while the latter requires more computing time. Therefore, wepropose improved support vector clustering algorithms. Firstly, the distribution of samples isobtained by mapping to the high-dimensional feature space. Then, extract a subset of samplesto form a condensation nuclei, get clustering results by using minimum spanning tree pruningstrategy or hierarchical clustering. Finally, the k-means clustering algorithm or discriminantanalysis method is used for classification of the remaining samples. Experimental resultsshow that the performances of new methods are better than the original, in the running time,the ability and robustness to process data.For a learning algorithm, the informative features are the key of the training model.Feature selection is an important mean to improve the performance of learning algorithms. Itselects the most informative features that related to specific problem from the original set, inorder to reduce the dimension of dataset, and get better performance results. Feature selectionmethod is a key step of data preprocessing in pattern recognition area. By far, most of the algorithms use all the samples to evaluate the effectiveness of features, which do not considerthe effect of the abnormal sample and the samples’ distribution. This paper presents a newmethod to improve the effectiveness of the filters by sample localization. For each test sample,just do feature selection according to the distribution of k nearest samples. This method isapplied on the datasets of acute leukemia, prostate cancer, colon cancer, breast cancer, diffuselarge B-cell lymphoma, and lung cancer. The experiment results show that: the performancesof t-test, permutation t-test and MRMR based on sample localization are significantly betterthan the original, respectively.Secretory proteins are the proteins which are formed in the cells while functioned outside,including cytokines, chemokines, digestive enzymes, hormones, antibodies, extracellularprotease and toxins. They play an important role in the process of immune defense, bloodcoagulation and cell communication and other kinds of physiology, so they are closely relatedwith the malignant tumor angiogenesis, differentiation, invasion and metastasis process.Because the proteins could be secreted in an autocrine or paracrine forms into blood, urine orsaliva and other body fluids, we can easily get them noninvasively in clinical. Therefore, thesecretion of proteins in body fluids is an important source of biomarkers of diseases. With thecontinuous progress of science and technology of proteome, salivary diagnosis has become ahot research topic, which is paid wide attention to by the majority of researchers. Comparedwith serum samples, the process of saliva sampling is simple, sufficient and noninvasive, norisk of hematogenous spread of disease; compared with the urine samples, saliva sampleshave the advantages of real-time sampling. So it is suitable for large range of health surveys,especially suitable for the detection of medical conditions limited geographical or infantdisease.So far, a series of computational methods have been successful to identify the proteinsthat secreted into blood, excreted into urine, and get into saliva from blood circulation. Thispaper presents a computational method to identify the saliva-secretory proteins in humansaliva. At first, a collection of saliva-secretory proteins is constructed through the publicdatabases and published papers. According to the protein families’ information, construct theother collection of non-saliva-secretory proteins. Secondly, organize and summarize all kindsof feature information of the proteins, using software or online tools to change these featuresinto data form, and store them up. The informative features relevant to saliva-secretoryproteins are selected out by feature selection method. Finally, a classification model is builtbased on these selected features. The proteins identified by this model are powerful candidatein saliva for the diagnosis of human disease, which will promote the further development of saliva diagnosis. Furthermore, the improved clustering algorithm and feature selection methodare used to optimize the model by strategically picking up the non-saliva-secretory proteins intraining set and improving the procedure of feature selection. Experimental results show thatthe accuracy of new model has been significantly increased.Finally, for binding to saliva secretion protein recognition process in, to optimize thetraining set selection and feature selection process, experimental results show that the model’saccuracy has increased significantly.
Keywords/Search Tags:Machine Learning, Clustering, Feature Selection, Saliva-Secretory Protein Identification
PDF Full Text Request
Related items