Research On PU Text Classification Based On Similarity Method

Posted on:2019-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:L Zhang

Full Text:PDF

GTID:2428330548985892

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The text classification problem is one of the important tasks in the field of data mining,information retrieval and other research.However,existing classification problems usually follow a common framework:learn a model from a training set,and then use the model to predict and classify new data.However,existing frameworks rely on assumptions that the training data used must be fully labeled,and that the data categories to be predicted must be covered by the type of training data.In practical applications,this type of data is usually faced:only positive sample with annotation,and the rest of the large amount of sample data are unlabeled data,which makes the traditional classification algorithm invalid.For this reason,scholars have opened up a brand new field of study:Positive Unlabeled Learning based on positive examples,PU learning.The main research content of this dissertation is as follows:(1)Summarize the related concepts and definitions of PU classification and false comment recognition.Summarize and summarize the related technologies and methods of PU classification task and false comment recognition problem,and the evaluation criteria.(2)For the problem of PU classification,the absence of negative examples and manual labeling of data are expensive and time consuming.A PU classification algorithm based on similarity is proposed.The algorithm firstly evaluates the data distribution in the sample,uses the integration mechanism to extract a reasonable number of positive and negative examples from unlabeled samples,and then uses similarity to extract representative positive and negative example microclusters.After obtaining a sufficient number of positive and negative examples,the PU problem was converted to a binary classification problem.The experimental results on multiple data sets verified the validity of the method.(3)In the problem of false comment recognition,the problem of manually labeling data is inefficient and error-prone.A fictitious comment recognition method based on PU learning algorithm framework is proposed.This method converts the problem of false comment recognition into PU classification by extracting credible reviews and bad reviews from unlabeled comments,combined with a small amount of existing real comments.The problem is solved.The experiment verifies the feasibility and effectiveness of the method.

Keywords/Search Tags:

PU learning, Text Classification, Unlabeled data, Similarity

PDF Full Text Request

Related items

1	Research On Positive Unlabeled Learning Algorithms For Text And Time Series Data
2	A Study On Learning From Positive And Unlabeled Examples
3	Learning with unlabeled data
4	Sentiment Text Classification Research Integrating CNN And Bi-LSTM Deep Learning Algorithms
5	Research On Positive Unlabeled Learning Algorithms For Graph Data Classification And System Implementation
6	Using unlabeled data to improve text classification
7	Research On Partially Supervised Classification
8	Bayesian Classifier For Positive Unlabeled Learning With Uncertainty
9	Based On The Positive And Unlabeled Samples, Semi-supervised Classification
10	Research On Key Techniques In Text Mining