Font Size: a A A

A Study On Learning From Positive And Unlabeled Examples

Posted on:2010-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:B Z ZhangFull Text:PDF
GTID:1118360272497280Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of World Wide Web, the emergence of information in the Web presents an explosive and exponential way, and searching for something that people required likes looking for a needle in a haystack, and this is the so-called "information explosion" problem, that is, information is greatly abundant while the knowledge is relatively scarce. Thus came the birth of the search engine, which has turned out an exciting area for research. Now it has become one of most important applications of Internet information service tools. However, due to the nature of its free structure and constant dynamic change, sometimes people can not access the important information that they need, and focused crawling is the basic approach to solving it.The main method of focused crawling comes from the system that S. Chakrabarti built in 1999, which adopted the method for collecting the focused information based on examples web page driving. The technical difficulty of focused crawling lies in whether the users can make an accurate anticipation of the related topic of the web page before downloading the actual page, which can come true only through the machine learning techniques. Funded by the National Natural Science Foundation "Based on the Incremental and Mobile Characteristics of the Focused Crawling Technology" (Grant No. 60373099), this thesis studies the key technology of focused crawling-text classification, and regards the prediction of topic relevance as the relevance between documents, thereby the judgment of topics is changed to an issue of learning from positive and unlabeled examples (LPU), and a summary is made of the attempt and exploration of it.The biggest problem of LPU is lack of negative examples that can be used, so the traditional algorithms of supervised learning and semi-supervised learning can not be effectively applied. This thesis firstly traces the paradigm of LPU,and makes comprehensive surveys of the related work and an in-depth study. The text mining technologies based on the up-to-date machine learning are introduced into the LPU issue and then get disseminated. This thesis puts forward some new solutions and attains some fruitful and productive research results. The main innovation of this thesis includes two aspects:The first aspect of the works is based on the two-stage strategy, which is the most studied problem: in stage 1, extracting a set of negative examples called reliable negatives (RN) from the unlabeled examples U as the initial negative examples; in stage 2, building a set of classifiers by iteratively applying a classification algorithm for U minus RN set, and expanding the quantity of reliable negative examples, thus the final classifier is obtained. Since Support Vector Machine (SVM) has been confirmed as one of the best classifiers in text classification, so usually the iterative Support Vector Machine method is used in stage 2. The research is mainly concentrated in stage 1--extracting reliable negative examples. Based on the machine learning technology, this thesis puts forward three novel algorithms for reliable negative examples extraction:(1)A novel algorithm for extracting reliable negative examples has been put forward based on the classical k-Means clustering algorithm. Clustering assumption, a basic assumption for semi-supervised learning, has been used widely in the semi-supervised learning. Text classification method aided by clustering can be divided into three major study fields, that is, feature selection based on clustering, clustering in semi-supervised classification, clustering in large-scale classification problems. The proposed method firstly uses k-Means algorithm to cluster the training set (including the positive examples set and unlabeled examples set), and then labels some cluster where the proportion of positive examples is below a certain threshold as reliable negative examples. Experiments show that the quantity of reliable negative examples obtained in this method is moderate, and well balances the ratio of the initial positive and negative examples, and with a strong ability to distinguish, so good experiment effect is obtained.(2)Another novel algorithm of extracting reliable negative examples has been put forward based on the constraint-based k-Means clustering. The existing small amount of supervised information is used to boost the clustering performance in the classical clustering algorithm, and thus has given rise to the semi-supervised clustering algorithm, one of the most important studies is constraint-based clustering. In order to use the existing positive examples label in the clustering algorithm, the constrained k-Means clustering algorithm is introduced, which is a relatively new semi-supervised clustering algorithm. In clustering performance, firstly, the positive examples set is used to initialize the centroid, and the positive examples label is used as the Must-link constraint. And then some proportion of examples near the centroid is labeled. It should be noted that this method also expands some positive examples. This work is still rare in current study at LPU, and is only used in PNLH algorithm, it is an innovative attempt in this thesis.(3) The third novel algorithm of extracting reliable negative examples has been put forward based on the Ranking learning using kNN algorithm. Ranking learning originates from the field of information retrieval, and is a machine learning method between classification and regression. kNN algorithm has been used to sort the unlabeled examples by computing the similarity of unlabeled examples and its positive k-nearest neighbor. Instead of using the usual method that labels the examples that have the highest rank, but labeling the examples with a lower threshold value as reliable negative examples, and discuss threshold selection issues through experiments.Two novel LPU methods have been proposed as the second aspect work in this thesis based on Co-training paradigm, which is the most important semi-supervised learning method. Standard Co-Training paradigm assumes that the attribute set has two sufficient and redundant view, that is, two attribute sets must meet the conditional independence assumptions: (1) Each attribute set is sufficient to describe the problem, in other words, each attribute set can be sufficient to train a strong learner if it has enough training sample; (2) given the label, each attribute set is conditionally independent of another attribute set. Thus a separate classifier can be trained for each view using the labeled examples, then, in Co-training performance, each classifier selects and labels a number of examples with higher confidence, and updates another classifier with the newly labeled examples. Co-training algorithm runs a continuously iterative process until meets a stop condition. As a typical semi-supervised learning method, the Co-training algorithm searches the unlabeled examples feature space with complementary classifier. In general, the complementary classifier can be attained by characteristics or complementary model. The effectively intuitive explanation of this approach is: the uncertain results of one classifier may be identified for complementary classifier. The final results are complementary to each other through integrated classifier results. In the case of an appropriate classifier, Co-training algorithm performs better than self-training algorithm. This thesis puts forward two new algorithms based on two most important Co-training algorithms: the Co-training and the Tri-training algorithm:(1)LPU based on Co-EM SVM algorithm. Co-EM SVM algorithm has been improved from the standard Co-training algorithm under EM algorithm framework and replaced the original embedded NB classifier with SVM classifier, which has been used commonly in the field of text classifier, and also been proved to be one of the best classifiers. The proposed method firstly splits the views based on 1-DNF algorithm into the positive feature set and negative feature set, and then runs a reliable negative extraction algorithm in a single view to extract the reliable negative examples for the view, thus ensures the training of the initial classifier, and finally Co-EM SVM algorithm can be applied to LPU, with better results through experiments.(2)LPU based on Tri-training algorithm. Tri-training algorithm further relaxes the constraints of Co-training algorithm, and does not require sufficient and redundant view, nor does it need to use different types of classifiers. The notable feature of Tri-training algorithm is the usage of three classifiers, not only can it easily deal with the estimated confidence level, but also solve the prediction problem of unlabeled examples, and can also use ensemble learning to improve the generalization ability. The proposed method uses the reliable negative examples extracted by the extraction algorithm of the three existing reliable negative examples to initialize the three SVM classifiers, instead of the bootstrap sampling process of standard Tri-training algorithm; Examples labeling strategy is adding the training set to the third classifier when other two classifiers have both determines an example as positive or negative; Termination condition is when three classifiers have no examples to classify; the final output is a combination of the three classifiers. The method is proved effective through experiment.In this thesis, the proposed work has been compared through experiments in the popular text classification data set-Reuters 21578, under evaluation measures of text classification, and the experiment results show that proposed works are better than related works, and the research results were issued. Although some achievements have been attained in LPU, there is room to improve and go further. For instance, this thesis does not take into account related issues of the manifold assumptions in the semi-supervised learning; how to study the distance function of the semi-supervised clustering as well as kernel-based clustering issues; another issue is the lack of formal verification and extension of the effectiveness of Co-training algorithm, although a view splitting algorithm based on positive examples is put forward, it is only an empirical conclusion and has not given out formal proofs. Since little research is done currently in this work, it is worthy of in-depth research and extension. In short, the research and application of LPU is in its infancy stage, there are great deals of things for our profound research, which will lead the author into further exploration.
Keywords/Search Tags:Learning form Positive and Unlabeled examples, Semi-Supervised Learning, Text Classification, Text Clustering, Support Vector Machine, Ranking Learning, Co-Training, Tri-Training
PDF Full Text Request
Related items