Font Size: a A A

Research On Text Classification With Noisy Labels

Posted on:2022-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:P J YangFull Text:PDF
GTID:2518306569494774Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of deep learning,the performance of text classification systems has improved dramatically.In the financial field,text classification systems can alert users and enterprise timely to reduce their losses.This relies on classifying the text of financial events in daily news.There are lots of user's comments data on many websites such as film forums,delivery platforms and Taobao.By classifying these data,enterprise can understand users' preferences and make improvements to products.It provide users with better services.Training these systems requires clean labeled data,but it is very difficult to obtain a large amount of these data.Therefore,the problem of noisy labels is a common problem in text classification.How to eliminate the negative influence of noisy labels and effectively train deep learning models is very important.It's also an important research content of this dissertation.Research in the field of text classification pays little attention to the problem of noisy labels.Therefore,there is no public dataset for text classification with noisy labels.This paper proposes two methods for constructing noisy labels datasets.One is to use manual methods to corrupt the labels in the clean dataset,and to construct datasets with different noise ratios through a uniform flip method.The second is to collect and construct a user comment dataset.Then construct a text classification dataset with noisy labels in real world based on the data's characteristics.Aiming at the problems in learning with noisy labels methods,a noisy label text classification model based on collaborative training is proposed.This method constructs two heterogeneous classifiers and makes the two classifiers select clean samples to train each other.This can reduce the impact of noisy interference during model training.The number of clean samples is determined by a rate function.It ensures that the model can learn more from clean samples in the first few rounds of training.Experimental results show that the method based on collaborative training has better accuracy in high-noise scenes than other methods.The collaborative training method only uses selected clean data for training.However,the filtered noisy samples also contain a lot of useful information,and removing them will inevitably cause information losses.To solve this problem,this paper proposes a noisy text classification method combined with semi-supervised tasks.The filtered data is re-added to the model training process as unlabeled data.The semi-supervised tasks used include pseudo-label task and consistency learning task.The experimental results show that the semi-supervised task can alleviate the information loss caused by the collaborative training method.It also improve the accuracy of collaborative training model.Finally,the application of the research results in the enterprise intelligent credit risk control system is shown.
Keywords/Search Tags:text classification, noisy labels, co-training, semi-supervised leanring, consistency training
PDF Full Text Request
Related items