Font Size: a A A

Research Of Imbalanced Dataset And Application In Prediction Of Protein-Protein Interaction Sites

Posted on:2012-11-27Degree:MasterType:Thesis
Country:ChinaCandidate:L N ZhangFull Text:PDF
GTID:2218330338470609Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The imbalanced dataset problem usually appears in the production and life, especially in many real applications, such as credit card fraud detection, information retrieval, network intrusion detection, medical diagnosis, text classification and detection of biological information and so on. Generally, traditional classification algorithms have better classification results on balanced datasets, the evaluation criteria of them are based on accuracy, however, classification results are poor on imbalanced datasets, minority class samples often be classified into majority by mistake, not achieving the purposes of classification. However, the recognition rates of minority class in the small number tend to have greater significance on unbalanced dataset. On imbalanced dataset, the distribution of the minority class samples is very loose, and minority class samples often be surrounded by a large number of majority class samples, and it is one of the major challenges faced by learning of minority class samples. Therefore, the appearance of new criteria and new classification is urgent on the research about the classification problems of imbalanced data sets. Because the imbalanced datasets often occur in practical applications, it gives a great challenge to traditional classification methods, how to effectively deal with unbalanced data set has aroused the concern of people. Classification of imbalanced data sets not only becomes another new hotspot in the field of machine learning and data mining, it but also arouses the research interest of pattern recognition and data mining experts. In recent years, there are a set of relevant topics discussed about the unbalanced dataset on ACM, the IEEE, machine learning, pattern recognition, data mining and some other related conferences.For the shortcoming of the under-sampling method, we propose a modified algorithm based on clustering. In order to ensure the overall performance of the classification, improve the accuracy of minority prediction and avoid data lose of majority class samples containing important information, we combined the selective sampling technique with random sampling technique to sampling from majority samples, and proposed an under-sampling method based on K-means clustering, experiments on the UCI data sets validated its effectiveness. We applied it into the prediction of protein-protein interaction sites, it effectively solved the class imbalanced problem that appeared in protein-protein interaction sites prediction, and enhanced the recognition rate of the protein-protein interaction sites.Overall, the main contents of this paper are as follows:1. Introduce the background and significance of the unbalanced data sets research and ensemble learning. Mainly described the problems faced by imbalanced data set classification and the solving strategies, the research methods and practical applications of ensemble learning.2. In order to improve the accuracy of the minority class samples as well as keep the overall performance of the classification, and avoid the loss of the majority class samples containing important information as possible, an unsupervised learning method was introduced, an under-sampling method based on K-means clustering algorithm was proposed. The experimental results on UCI data sets show that the under-sampling method based on K-means algorithm can effectively improve the recognition rate of the minority class sample and keep the overall classification performance. This method can also solve the imbalanced dataset classification problem in real life.3. Describe the background and significance of protein-protein interaction sites research. In order to further improve the accuracy of protein-protein interaction sites prediction, this paper presents an ensemble method based on constructive neural network to predict protein-protein interaction sites, protein sequence profile and residue accessible area are used as feature vector, we use 11 windows to predict protein-protein interaction sites. Compared with the traditional SVM and covering algorithm, the algorithm has a better overall predictive performance, indicating that the ensemble learning algorithm based on coverage in protein-protein interaction sites prediction is correct and effective.4. For imbalanced class sample existed in the data set of protein interaction sites, we analysis its impact on protein interaction sites prediction. To further improve the recognition rate of interface residues, we introduced the under-sampling method based on K-means clustering algorithm into the protein-protein interaction sites prediction, by which obtained the balanced protein data sets, experimental results show that the method can effectively solve the class imbalanced problem appeared in protein interaction sites prediction, and effectively improve the recognition rate of interaction sites.
Keywords/Search Tags:Unbalanced dataset, Under-sampling, Cluster, Protein interaction sites, Ensemble learning, Covering algorithm
PDF Full Text Request
Related items