| Instances labeling is an expensive and time-consuming task in machine learning.As one of important aspects in machine learning, active learning aims at choosing the most informative unlabeled instances by selection strategy for human experts to label,so that minimizing the amount of labeled instances required to achieve an accurate predictive model. Thus, the instances selection strategy is crucial to active learning.Recently, the data form-data stream has been widely concerned. Data streams pose serious challenges to active learning due to these features which are distinctly different from traditional data model, such as huge scale, arrive fast, and the data distribution may change anytime. While few works about active learning instance selection strategy exist in the data stream setting, many works exist in traditional data form. In this context, this paper made a research on clustering-based instance selection method of active learning. Firstly, this paper propose a clustering algorithm which can find arbitrary shaped and different density clusters to partition the instances,and then, quantify the homogeneity of predicted class distribution in each cluster, and propose an instance selection strategy which combine representative and uncertainty to select the best instances for active learning. The concrete content includes the following two aspects:Firstly, in order to reflect the data distribution in data stream better, we made a research on clustering algorithms. For the problem that most of clustering algorithms either fail to find arbitrary shaped and different density clusters or have a high computational complexity, we provided a two-stage clustering algorithm. First, we partition the dataset by a fast clustering algorithm, and on this basis we using the Distance-Relatedness dynamic model, which reflect the degree of a cluster’s density by neighborhood distances, to merged neighbor clusters which have the approximate density so that we can speed up the process of finding arbitrary shaped and different density clusters. Experiments show that the algorithm can obtain arbitrary shape and different density clusters, and compared with the same algorithms, the time efficiency is improved significantly.Then, for the problem that concept drift may happen anywhere in the instance space of data streams, we propose an active learning approach for data streams based on the clustering algorithm of this paper. We consider a batch incremental setting, and cluster the instances of each new batch, and make it a priority to choose the clusterwith the most disagreement in the prediction of the classifier, then select the most informative instances belonging to the cluster by the measure which combine uncertainty and representative. We select instances from different cluster to track potential concept drift. The experimental results shows that the instances selection algorithm has a better performance on accuracy of classifier and stability in data stream than algorithms compared. |