Font Size: a A A

Semi-Supervised Resource Classification Based On Improved Label Propagation Algorithm

Posted on:2009-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:X J RenFull Text:PDF
GTID:2178360242980625Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing capacity of the network information, searching for specific information and automatically discovering useful information become increasingly important, but also more and more difficult. Therefore, there is a need for a good automatic classification algorithm which can achieve good classification results at minimum manpower.In Network Educational Resources Management System (NERMS), a large number of educational resources can be searched from the network through the search engine. The resources need to be classified, so that users can conveniently search them and administrators can effectively manage them. Some resources may not belong to the classification theme in the system, we should find these resources and in accordance with their content add a new classification theme to the system. At the same time large number of extraneous information will be provided by the search engines, so we need to identify this noise resources and delete them. Therefore an effective classification method which can classify resources effectively, but can also identify noise resources or resources not having classification theme is needed. Traditional machine learning divided into supervised learning and unsupervised learning. Supervised learning methods use labeled datasets to train the classifier. However, as human, financial or some other reasons, getting labeled data is often very difficult, because it requires manual labeling process. Unlabeled data collection is relatively easy. Unsupervised learning is unlabeled datasets clustering, but it search in solution space with a certain blindness, the accuracy of clustering result so to a certain extent, is not high. In addition, in high-dimensional space, selecting an appropriate distance metrics is rather difficult.Semi-supervised learning using labeled and unlabeled data has become an increasingly popular field of study in recent years and takes the researcher's more attention and favor. It attempt to make use of a large number of unlabeled data to study the internal structure or the law, and on this basis use a small amount of labeled data for the study guide. Present work includes semi-supervised classification, semi-supervised clustering, and so on. Comparing with unsupervised learning, the learning method which use of a small portion of labeled data to guide learning, can improve the precision of the study. On the other hand, semi-supervised learning uses the bulk of unlabeled data, which lower the demands for labeled data as supervised learning. Semi-supervised learning is a hot study domain in theory because the natural learning systems, including our own, are learning on the labeled and unlabeled data. Besides it use a small amount of manpower at the same time can maintain the quality of the learning, so it also is concerned about at practice.Label propagation algorithm is one of graph-based semi-supervised learning method. We formulate the problem as a form of propagation on a graph, the nodes in the graph represent labeled and unlabeled data, and edges(may be weighted)reflect the similarity of data. A node's label propagates to neighboring nodes according to their proximity, the greater similarity between the nodes the easier the labels be propagated. In implement, for each node we define a probability distribution table on category to represent the probability it belongs to each category, we take this probability distribution table as the node's label. In propagation, each node updates its probability distribution according to its neighbor's probability distribution. And the neighbors which are more similar to the node impact its probability distribution more significantly, thus similar nodes will have similar probability distribution in the end. The propagation implements iteratively until the probability distribution of nodes converge to a fixed value.Label propagation algorithm can effectively label the data, but it is helpless when the noise data is exists, as the algorithm will label every data finally and doesn't consider whether the data is noise or not. Therefore we improved the label propagation algorithm. The core content of the improvement is calculating the similarity with the category for each point in propagation process. If a node's similarity with each category is small than a threshold value, then we consider this point is noise, that is it don't belong to any category. In the propagation process, we prohibit the noise from propagate its label to any of its neighbors.In this paper, we experiment the improved label propagation algorithm on the UCI standard data sets, and compare it with the original algorithm. The result shows that the improved algorithm maintains the classification correct rate, at the same time, it can find the noise data, avoiding the noise data to be classified to norm category. In addition, we experiment the density of the graph used in label propagation algorithm, discuss the value selection for k in the kNN graph. Through comparison of the value of k on three different data sets, we found that the result of fully connected graph is worse than sparse graph in label propagation. K values between 4 to 20 is a good choice, its value should be decided according to the actual data model. Experiment also compared the results on the different size of data sets。It can be seen that relatively larger data sets will result higher classification correct rate. This fully shows that the study of the distribution and structure of unlabeled data can effectively help classification and improve the classification correct rate. Applying the improved propagation algorithm in NERMS system toclassify the resources obtains good result. It can effectively classify the resources with a small number of labeled data, making a great convenience for user's browse and administrator's management. In addition identifying the noise data can expand the system's classification theme, make it increasingly perfect. At the same time some useless or not related resources can be deleted, improve the quality of the system resources. We briefly introduce the classification theme level tree and resources classification of the NERMS system, based on this we describe how to apply the improved label propagation algorithm to the resources classification, including the steps of data acquisition, initialization, pre-classification, label propagation, results displaying.In this paper, the algorithm implement on the basis of the label propagation algorithm, which inherit the advantage of semi-supervised learning. At the same time the noise identify function is added for avoiding wrong classification of noise data. There are still problems in this thesis: time complication is high, it is not improved compared with the original algorithm. In addition because of using matrix to storage the similarity of the nodes, so large storage space is needed, this make large data set classification can't achieve. These problems still need to be solved.
Keywords/Search Tags:Semi-Supervised
PDF Full Text Request
Related items