Research On LDA-based Multi-annotated Text Classification

Posted on:2020-05-21

Degree:Master

Type:Thesis

Country:China

Candidate:T W Hou

Full Text:PDF

GTID:2428330572498913

Subject:Architecture and civil engineering

Abstract/Summary:

PDF Full Text Request

The development of Internet technology accelerates the arrival of the period of big data.The rapid acquisition and analysis of big data become an indispensable capability of contemporary Internet applications.Traditional supervised learning algorithms require a large amount of data with expert annotations to train high-performance models,but obtaining expert annotations is often time consuming and labor intensive.Crowdsourcing system use the wisdom of the crowds to label data with lower cost,and crowds can respond quickly to tasks,so the system is widely used now.However,the crowds have different knowledge background and working ability,so the collected multi-labels have noise and cannot directly represent the true labels.Multi-annotators classifier is a way to integrate multiple labels.The article will study the multi-label source classifier for the purpose of improving the accuracy of multi-label source classifier and saving labeling cost.The main work carried out in this paper in details are as follows:(1)Deep AUC weighted Naive Bayes algorithm is proposed.Naive Bayes algorithm is easy to classify samples belonging to minority classes into majority classes because of its strong independence assumption and feature with equal importance assumption.Due to the different importance of each sample in the classification,the sensitivity curve area value is used to weight the feature to achieve the purpose of improving the performance of the algorithm.(2)A Feature Integration method is proposed.In the process of labeling,the accuracy of multi-annotators is often influenced by the topics in text.The text features are constructed by blending the topic model feature and the word2 vec word vector feature,and then the Gaussian process multi-annotators classifier is used to train the model to obtain a higher performance model.The new model will inference more precise labels when new unlabeled texts come.(3)A method combined with active learning and multi-annotators classifier is proposed.Active learning and crowdsourcing systems have the same purpose to reduce costs.Active learning first initializes the model parameters with a small number of labeled samples,then uses the appropriate rules to pick the most valuable samples for manual labeling,and iterates this step until that the termination condition is met.The crowdsourcing system provides manual labeling methods for picking samples.According to the active learning theory In each iteration,we choose the most suitable annotator for labeling the sample.Experiments are carried out on two different datasets by the above three methods.The results show that the Deep AUC weighted Naive Bayes is beneficial to improve the classification performance of the classifier in the skewed samples.The Feature Integration method can further improve the performance of Gaussian process multi-annotator classifier;Active learning sample labeling can achieve higher accuracy than random sample labeling.

Keywords/Search Tags:

Crowdsourcing, Text classification, Feature Integration, Active learning

PDF Full Text Request

Related items

1	Research On Chinese Text Classification Algorithm Based On Active Learning Approach
2	Research On Key Techniques And Applications In Text Classification
3	Research On Feature Description And Classifier Construction Algorithm In Chinese Text Classification
4	Research On Text Classification Based On Active Self-Paced Learning
5	The Design And Implement Of A Mongolian Text Classifier Based On Active Learning SVM
6	Design And Implementation Of Text Classification System Based On Active Learning
7	Research And Application Of Chinese Text Classification Based On Active Learning
8	Chinese Text Classification Based On Active Learning
9	Research On Text Classification Of Active Learning And Its Application
10	Active Learning and Crowdsourcing for Machine Translation in Low Resource Scenarios