| Natural language processing tasks in the era of deep learning often rely on large-scale training data for supervised training,but the cost of manual data annotation can be very expensive and time-consuming due to the professionalism of annotators and the scarcity of domain experts.Some researches carry out automatic text annotation through relevant corpus and expert rules in the field,but in many minority research fields,there is no largescale structured corpus to use,and it is difficult to formulate accurate expert rules.Other studies expand the training data through the text generation model to improve the performance of the automatic annotation model under weak supervision,but the performance of the generation model is also poor under less training data.Based on the above problems,an Attribute Controlled Text Representation Generates Auxiliary Annotation Models(ACTRAnno)is proposed,which is mainly used in annotation tasks in the field of text classification.ACTRAnno carries out labeling in the way of batch incremental iteration.In each round of labeling process,all unmarked data are pre labeled,and then pre labeled samples of fixed batch size are selected for manual labeling.ACTRAnno uses the attribute controlled text representation generation model to expand the training data.The generation model does not directly generate the text,but generates the text representation vector as the input of the downstream task.And it makes the downstream classification model share some network parameters with the text representation generation model to reduce error propagation and improve the performance of text generation as a data enhancement tool for the downstream text classification task.The active learning method is used to construct the most effective training set for the overall training of the annotation model in the form of incremental iteration,so as to further improve the accuracy of the model.Aiming at the selection strategy based on model uncertainty in active learning,which leads to the unreliability of model uncertainty due to the lack of early data,a two-stage active learning model is proposed,which provides a scheme for constructing a more accurate text annotation model under weak supervision.Experiments show that on the IMDB,SST-2,YELP-2,and AG News datasets for the text classification task,the model accuracy improves by an average of 2.00% to 3.35% when the training samples are less than 1000,compared to the case with no data enhancement,and further improves by 1.4% when using the two-stage active learning strategy. |