Font Size: a A A

Research On Positive Unlabeled Learning Algorithms For Text And Time Series Data

Posted on:2016-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:D K ZhangFull Text:PDF
GTID:2308330461466596Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a subtask of data mining, classification analysis has a wide range of application in people’s production and life. Traditional classification algorithms require users to provide labeled training samples. However, due to the limitation of human labor, money and time, users usually only selectively label a small number of samples from the categories they are interested in as positive samples. How to use these few positive samples and a large number of unlabeled samples to build a classifier is the research content of Positive Unlabeled Learning(PU Learning). With great practical importance, PU Learning has received a wide range of researchers’ attention. For text data and time series data, this paper discusses new PU Learning algorithms.(1) The proposed algorithm about PU Learning for text data aims at the PU Learning scenario where training data contain only a small number of positive data and devises the Wikipedia Knowledge based Neighborhood Kernel(WKNK) with the help of ample knowledge in Wikipedia and the idea of Neighbourhood Kernel. When the threshold is 0.25, experimental results on Reuters-21578 show that compared with the averaged 1F value of One-class Support Vector Machine(One-class SVM) with linear kernel, that of One-class SVM with WKNK imporves by 10.1%; experimental resuls on 20-Newsgroup show that compared with the averaged 1F value of One-class SVM with linear kernel, that of One-class SVM with WKNK imporves by 54.8%. The experimental results indicate that WKNK can effectively overcome the difficulty caused by lack of training data in one-class text classification and improves the classification performance of One-class SVM.(2) The proposed algorithm about PU Learning for time series data devises Positive Unlabeled Markov(PU Markov) time series classifier with Markov property and “selected completely at random” assumption. Experimental results show that on 14 UCR time series datasets that satisfy Markov property, compared with the averaged 1F value of Euclidian and Dynamic Time Warping(DTW) distance based Positive 1-Nearest Neighbor(Positive 1-NN) classifier, that of 2nd order PU Markov classifier improves by 12.5% and 5.2%, 16.1% and 9.4%, 18.0% and 11.1%, with the labeled rate of positive samples being 0.3, 0.4, 0.5 respectively. Meanwhile, compared with DTW distance based Positive 1-NN classifier, which has better classification performance than Euclidian distance based Positive 1-NN classifier, 2nd order PU Markov classifier consumes less time to be trained and tested, with the time used for being tested much less.
Keywords/Search Tags:Positive Unlabeled Learning, Text Classification, Time Series Classification
PDF Full Text Request
Related items