Font Size: a A A

Research On LDA-based Multi-annotated Text Classification

Posted on:2020-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:T W HouFull Text:PDF
GTID:2428330572498913Subject:Architecture and civil engineering
Abstract/Summary:PDF Full Text Request
The development of Internet technology accelerates the arrival of the period of big data.The rapid acquisition and analysis of big data become an indispensable capability of contemporary Internet applications.Traditional supervised learning algorithms require a large amount of data with expert annotations to train high-performance models,but obtaining expert annotations is often time consuming and labor intensive.Crowdsourcing system use the wisdom of the crowds to label data with lower cost,and crowds can respond quickly to tasks,so the system is widely used now.However,the crowds have different knowledge background and working ability,so the collected multi-labels have noise and cannot directly represent the true labels.Multi-annotators classifier is a way to integrate multiple labels.The article will study the multi-label source classifier for the purpose of improving the accuracy of multi-label source classifier and saving labeling cost.The main work carried out in this paper in details are as follows:(1)Deep AUC weighted Naive Bayes algorithm is proposed.Naive Bayes algorithm is easy to classify samples belonging to minority classes into majority classes because of its strong independence assumption and feature with equal importance assumption.Due to the different importance of each sample in the classification,the sensitivity curve area value is used to weight the feature to achieve the purpose of improving the performance of the algorithm.(2)A Feature Integration method is proposed.In the process of labeling,the accuracy of multi-annotators is often influenced by the topics in text.The text features are constructed by blending the topic model feature and the word2 vec word vector feature,and then the Gaussian process multi-annotators classifier is used to train the model to obtain a higher performance model.The new model will inference more precise labels when new unlabeled texts come.(3)A method combined with active learning and multi-annotators classifier is proposed.Active learning and crowdsourcing systems have the same purpose to reduce costs.Active learning first initializes the model parameters with a small number of labeled samples,then uses the appropriate rules to pick the most valuable samples for manual labeling,and iterates this step until that the termination condition is met.The crowdsourcing system provides manual labeling methods for picking samples.According to the active learning theory In each iteration,we choose the most suitable annotator for labeling the sample.Experiments are carried out on two different datasets by the above three methods.The results show that the Deep AUC weighted Naive Bayes is beneficial to improve the classification performance of the classifier in the skewed samples.The Feature Integration method can further improve the performance of Gaussian process multi-annotator classifier;Active learning sample labeling can achieve higher accuracy than random sample labeling.
Keywords/Search Tags:Crowdsourcing, Text classification, Feature Integration, Active learning
PDF Full Text Request
Related items