Font Size: a A A

Research On PU Text Classification Based On Similarity Method

Posted on:2019-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2428330548985892Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The text classification problem is one of the important tasks in the field of data mining,information retrieval and other research.However,existing classification problems usually follow a common framework:learn a model from a training set,and then use the model to predict and classify new data.However,existing frameworks rely on assumptions that the training data used must be fully labeled,and that the data categories to be predicted must be covered by the type of training data.In practical applications,this type of data is usually faced:only positive sample with annotation,and the rest of the large amount of sample data are unlabeled data,which makes the traditional classification algorithm invalid.For this reason,scholars have opened up a brand new field of study:Positive Unlabeled Learning based on positive examples,PU learning.The main research content of this dissertation is as follows:(1)Summarize the related concepts and definitions of PU classification and false comment recognition.Summarize and summarize the related technologies and methods of PU classification task and false comment recognition problem,and the evaluation criteria.(2)For the problem of PU classification,the absence of negative examples and manual labeling of data are expensive and time consuming.A PU classification algorithm based on similarity is proposed.The algorithm firstly evaluates the data distribution in the sample,uses the integration mechanism to extract a reasonable number of positive and negative examples from unlabeled samples,and then uses similarity to extract representative positive and negative example microclusters.After obtaining a sufficient number of positive and negative examples,the PU problem was converted to a binary classification problem.The experimental results on multiple data sets verified the validity of the method.(3)In the problem of false comment recognition,the problem of manually labeling data is inefficient and error-prone.A fictitious comment recognition method based on PU learning algorithm framework is proposed.This method converts the problem of false comment recognition into PU classification by extracting credible reviews and bad reviews from unlabeled comments,combined with a small amount of existing real comments.The problem is solved.The experiment verifies the feasibility and effectiveness of the method.
Keywords/Search Tags:PU learning, Text Classification, Unlabeled data, Similarity
PDF Full Text Request
Related items