Font Size: a A A

Research On New Methods Of Unbalanced Learning In Bioinformatics

Posted on:2018-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:L Y ShenFull Text:PDF
GTID:2350330512478775Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinfonnatics is an interdisciplinary field that mainly involves life science and computational science.It focuses on applying computational and statistical techniques to solve real-world problems arising from the analysis and computation of biological data.Bioinformatics focuses on developing computational techniques to increase understanding of biological processes.It's significant to solve the class-imbalance problems which have a serious impact on the performance of standard classifiers in machine learning problems.The study of machine learning shows that applying the traditional machine learning methods directly to imbalance problems often leads to the tendency of the prediction results to the majority.The phenomenon of imbalance is common in the field of machine learning and bioinformatics.Protein-ATP(Adenosine-5'-triphosphate)binding residue prediction is a typical imbalanced learning problem.ATP interacts with protein in a wide variety of biological processes.It's very significant to accurately identify binding residues solely from protein sequences.A common approach of improving the prediction performance for imbalanced learning problems is to balance the sizes of different classes by changing the numbers and distributions within them.Oversampling is a popular method in dealing with class-imbalance problems,which attempts to balance the sizes of different classes by generating additional samples for minority class.In this study,we propose a new oversampling algorithm that synthesizes new samples for minority classes by the Gaussian mixture model.The Gaussian mixture model is employed to generate additional samples and data cleaning techniques,Tomek-links,is used to remove the borderline sample pairs,which result from oversampling process.Comparing with several state-of-art related methods,the experimental results on UCI datasets demonstrate that the proposed oversampling algorithm can relieve the severity of class imbalance and help to improve classification performance.We also apply the proposed algorithm to the protein-ATP binding site prediction problem to evaluate the effectiveness of the algorithm.In addition,the sparse representation technique is introduced to select the generated samples,which embody more explicit semantic information.The experimental results on several protein-ATP interaction benchmark datasets demonstrate the effectiveness of the proposed oversampling algorithm.
Keywords/Search Tags:imbalanced learning, oversample, Gaussian mixture model, sparse representation, data filtering, protein-ATP binding, binding residues prediction
PDF Full Text Request
Related items