Research On Positive Unlabeled Learning Algorithms For Text And Time Series Data

Posted on:2016-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:D K Zhang

Full Text:PDF

GTID:2308330461466596

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As a subtask of data mining, classification analysis has a wide range of application in people’s production and life. Traditional classification algorithms require users to provide labeled training samples. However, due to the limitation of human labor, money and time, users usually only selectively label a small number of samples from the categories they are interested in as positive samples. How to use these few positive samples and a large number of unlabeled samples to build a classifier is the research content of Positive Unlabeled Learning(PU Learning). With great practical importance, PU Learning has received a wide range of researchers’ attention. For text data and time series data, this paper discusses new PU Learning algorithms.(1) The proposed algorithm about PU Learning for text data aims at the PU Learning scenario where training data contain only a small number of positive data and devises the Wikipedia Knowledge based Neighborhood Kernel(WKNK) with the help of ample knowledge in Wikipedia and the idea of Neighbourhood Kernel. When the threshold is 0.25, experimental results on Reuters-21578 show that compared with the averaged 1F value of One-class Support Vector Machine(One-class SVM) with linear kernel, that of One-class SVM with WKNK imporves by 10.1%; experimental resuls on 20-Newsgroup show that compared with the averaged 1F value of One-class SVM with linear kernel, that of One-class SVM with WKNK imporves by 54.8%. The experimental results indicate that WKNK can effectively overcome the difficulty caused by lack of training data in one-class text classification and improves the classification performance of One-class SVM.(2) The proposed algorithm about PU Learning for time series data devises Positive Unlabeled Markov(PU Markov) time series classifier with Markov property and “selected completely at random” assumption. Experimental results show that on 14 UCR time series datasets that satisfy Markov property, compared with the averaged 1F value of Euclidian and Dynamic Time Warping(DTW) distance based Positive 1-Nearest Neighbor(Positive 1-NN) classifier, that of 2nd order PU Markov classifier improves by 12.5% and 5.2%, 16.1% and 9.4%, 18.0% and 11.1%, with the labeled rate of positive samples being 0.3, 0.4, 0.5 respectively. Meanwhile, compared with DTW distance based Positive 1-NN classifier, which has better classification performance than Euclidian distance based Positive 1-NN classifier, 2nd order PU Markov classifier consumes less time to be trained and tested, with the time used for being tested much less.

Keywords/Search Tags:

Positive Unlabeled Learning, Text Classification, Time Series Classification

PDF Full Text Request

Related items

1	A Study On Learning From Positive And Unlabeled Examples
2	Research On Positive Unlabeled Learning Algorithms For Graph Data Classification And System Implementation
3	Research On Partially Supervised Classification
4	Based On The Positive And Unlabeled Samples, Semi-supervised Classification
5	Decision Tree Generation Algorithm Based On Time Series Pairs
6	Bayesian Classifier For Positive Unlabeled Learning With Uncertainty
7	Maximize AUC With Outlier Detection For Positive-unlabeled Classification And Incremental Algorithm
8	Study On PU Learning Based On Associative Classification Algorithm
9	Research On Key Techniques In Text Mining
10	Time Series Classification,Retrieval Methods And Applications