Design And Implementation Of Microtopic Text Classification Method Under Weak Supervision

Posted on:2024-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:P D Li

Full Text:PDF

GTID:2558307079972549

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the Internet so advanced in modern times,there is a huge amount of information being shared and received every moment.Text is a basic carrier of information,and it is particularly important to classify it efficiently in the face of the huge and cluttered text data.Learning potential features for classification from a large amount of labeled text data and then automating the classification of unlabeled data is the main way of text classification methods.However,in real-world scenarios labeled data is often difficult to obtain while unlabeled data is cheap and abundant.In addition,most current text classification models perform poor on micro-topic classification tasks with high semantic similarity.Therefore,to address the strong reliance on high-quality labeled data and the lack of satisfactory classification results under micro-topics,this thesis designs and implements a weakly supervised micro-topic text classification method and develops a system to use with it.This thesis first investigates the traditional methods of text classification,the application of weakly supervised techniques,and the micro-topic classification method,and then improves and perfects the techniques in several aspects.Finally,this thesis forms the classification method in a well-designed system.Specifically,this method consists of four steps:(1)fine-grained label construction.The BERT model is used to predict the coarse-grained category labels and generate a fixed-size word list with high relevance as the basis for selecting micro-topic labels(fine-grained labels).(2)Construct the initial labeled dataset.The initialization of the weakly supervised set is done according to two strategies: "direct inclusion of label names" and "relevant word list coverage",which is the source of the training set needed for the final classification model.(3)Expanding the labeled dataset.The GPT model is used to generate more micro-topic pseudo-labeled intances.By improving the loss function of the GPT model,the hierarchical croase-tofine label information is exploited,which makes the quality of the generated micro-topic pseudo-labeled dataset improved.(4)Classifier iterative training.The pseudo-labeled dataset is used for the training of the classifier.The weakly supervised dataset is updated by an iterative method to improve the quality of the weakly supervised set and thus the quality of the text classifier.In this thesis,the effectiveness of the designed method is verified by several experiments.The comparison experiment against different weakly supervised classification methods shows that the weakly supervised text classification method designed in this thesis has better classification results than other methods.The final F1-Score of the classification result can reach 86.9% and it is similar to supervised methods.The ablation experiments show that the improvement of the links finally has a positive effect on the classifier training.Finally,a system for text classification is designed and implemented,which provides a source of data needed for downstream applications of natural language processing.

Keywords/Search Tags:

Text Categorization, Weak Supervision, Micro Topic

PDF Full Text Request

Related items

1	Topic Categorization Of Short Text Sequences
2	Research On Text Mining Application For Supervision Engineering
3	The Text Categorization And Structure Of Theme Words Network Based On Topic Models
4	Research On Text Processing Technology For Topics Of Hot News
5	Topic Web Mining Algorithms Research And Application
6	A Research On Weibo (Micro-blog) Data And The Construction Of A Blogger Analysis System
7	Research On The Method Of Short Text Categorization Based On Topical Similarity
8	Research On Method Of Short Text Sentiment Classification Based On Weak Supervision
9	Implementation And Application Of An Effective Text Categorization Method Named MDCC
10	Application Of Weak Supervised Learning On Text Classification