Font Size: a A A

Algorithms And Applications Of Imbalanced Data Classification Based On Semisupervised Learning

Posted on:2015-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:F Q LiFull Text:PDF
GTID:1228330467485979Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Semi-supervised learning is one of important machine learning approaches, which combines limited labeled data and mass unlabeled data to mine potential information. Most of classical semi-supervised learning approaches are used based on the assumption that all classes have the same amount of samples. However, the number of samples for different classes is often different. The class with fewer samples may be more significant to recognize than that with more samples. For example, the number of malicious users in social networks is less than that of trusted users, but it is more important to recognize malicious users. Therefore, it is necessary and significant to solve the classification problem in this situation that different classes have different number of samples.In this paper, we define the data mentioned above as imbalanced data. Most studies focus on classification of imbalanced data based on supervised learning approaches. We propose to classify imbalanced data based on semi-supervised learning from two aspects:algorithm level and data level. In algorithm level, existing classification algorithms are improved to apply to the imbalanced data. In data algorithm level, balanced data is constructed to apply to existing classification algorithms by improving existing resampling methods to change samples’ distribution. In this paper, we first propose a new semi-supervised classification approach in algorithm level. Then, a new resampling method is proposed to construct balanced data for serving existing semi-supervised classification approaches. Finally, the proposed approaches are applied in two real-world problems:the detection of forest fires and the privacy protection in social networks. The main contributions of this dissertation can be concluded as follows:(1) For traditional graph-based semi-supervised classification approaches, the imbalanced information for different classes during the process of label propagation will lead the problem of imbalanced classification. Therefore, we propose an approach called LMN, which exploits a balance factor to construct normalized label matrix. This will make each class has the same quantity of label information and then keep a balanced classification.(2) Traditional resampling methods fail to define classification boundary used for constructing artificial samples according to so few labeled data. Therefore, we propose a sampling approach called INNO, which iteratively select a certain amount of unlabeled samples close to labeled samples and add them into the set of labeled data. This will make a balanced labeled dataset for semi-supervised learning methods.(3) During the process of active learning based on boundary sampling, the samples which are the closest to decision boundary are selected and then labeled by domain experts. However, the selected sample may be close to known labeled samples, which is not helpful for classifiers. Therefore, we propose a classification approach for imbalanced data, which exploits similarity detection algorithm to avoid selecting samples near to the area in which known samples locate. This will be contributed to classifier for solving the problem of imbalanced data better.(4) The traditional algorithms of detecting forest fires are limited by energy efficiency, processing power and memory size. Moreover, balanced data is difficult to obtain due to the situation that the forest fires seldom occur. Therefore, we propose a detection approach of forest fires based on semi-supervised learning. First, the temperature variation curve is summarized into four patterns in offline phase. Then, the temperature sequence monitored by sensors is divided into subsequences of the same length. Finally, the previously proposed INNO is used to classify unknown temperature sequences for detecting forest fires.(5) Social network is one of the most successful services in mobile Internet. How much one can trust others has been one of users’most concerned questions. To avoid private information being exposed by malicious users, we proposed a two-way trust inference method to calculate local trust values. Taking the transferability of friend relationships and imbalanced data into account, we also propose a modified approach based on the previously proposed LMN. The proposed approach eliminates the restriction that there must be a path between any two users for trust inference and achieves higher accuracy.
Keywords/Search Tags:Semi-Supervised Learning, Imbalanced Data, Active Learning, WirelessSensor Networks, Social Networks
PDF Full Text Request
Related items