Font Size: a A A

Research On The Text Classification Method Based On Transfer Tonic Model

Posted on:2022-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhengFull Text:PDF
GTID:2518306491953229Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
When the labeled data is sufficient,the existing text classification methods can achieve relatively better results.Recently,the development of text classification methods based on deep learning has enabled the performance of text classification tasks to meet the application requirements of the industry.In particular,the two-stage text classification methods based on pre-training + fine-tuning has pushed the effect of text classification to an unprecedented height.However,while the effect of text classification is improving,the complexity of the model is also increasing rapidly,and the requirements for the scale and quality of the labeled data set are also increasing,which makes the application scenarios very limited.In practical applications,the training data of text classification tasks is often limited,especially for some specific fields,data collection and labeling are difficult.It is expensive to establish a high-quality labeling training set,and the lack of labeling data is often accompanied by the imbalance between categories.Therefore,how to obtain a better classification result with only a small amount of labeled data is a hot issue in the current research.This paper studies research on scenarios with only a small amount of labeled data and severe imbalance between the labeled data categories.The main research contents are as follows:(1)When the target domain lacks enough labeled data,transfer learning utilizes the labeled data of the related source domain to assist in improving the learning performance of the target domain.However,the data of the target domain and the source domain usually do not satisfy the independent identical distribution,which may easily lead to the problem of "negative transfer".Based on the supervised topic model(Supervised LDA,SLDA),this paper integrates transfer learning methods to propose a transfer topic model(Transfer SLDA,Tr-SLDA)that shares topic knowledge,and proposes a new method of Tr-SLDAGibbs topic sampling.Under the constraints of category labels,different sampling strategies were adopted for words in documents in different fields,and there is no need to specify the number of topics.To assist the source domain and the target domain to share the potential topic space,Tr-SLDA can effectively solve the "negative transfer" problem by discovering the semantic associations between potential shared topics and different domain categories to transfer knowledge from the source domain.The Tr-SLDA-TC(Tr-SLDA Text Categorization)text classification method is proposed based on the Tr-SLDA transfer topic model.Comparative experiments show that this method can effectively improve the classification performance of the target domain by using the knowledge of source domain.(2)In order to solve the problem of negative transfer effectively,a hierarchical transfer topic model(transfer SLDA,t SLDA)is proposed,which combines the topic model and transfer learning method.The t SLDA model introduces new parameters to identify the semantic association between cross-domain shared potential topics and shared hierarchical categories.The implicit topic sampling algorithm TSLDA-GIBBS is proposed according to the constraint of different hierarchical categories to identify the semantic mapping between shared potential topics and different category spaces.The transfer ability evaluation index and evaluation method of the t SLDA topic model are proposed.The transfer ability based on the model can effectively solve the negative transfer problem,thereby improving the generalization ability of the model;Finally,a new method of transfer learning based on the t SLDA model is proposed.Experiments show that the t SLDA transfer topic model can effectively identify the semantic mapping between topics and different category spaces,thereby improving classification performance.(3)In transfer learning,if the source domain is not properly selected,"negative transfer" will occur,which will reduce the performance of the target domain task.Sometimes it is difficult to find an appropriate source domain to assist the target domain task modeling.This means that the model must fully mine the information in the limited labeled data to improve the generalization ability.This paper considers the classification problem from the perspective of words,proposes a word granularity classification model and classification method based on SLDA(WL-TC),establishes the correlation between words and categories,and then infers the labels of the words in the test document,and finally classifies the test documents by concluding the categories of the documents through word labels.At present,the great achievements of deep learning have affected various fields of machine learning.Therefore,this paper further combines the WL-TC idea with the word embedding representation method in deep learning to propose a three-stage text classification framework(TSTC).Experiments have verified that the classification performance of WL-TC and TSTC under small sample data is better than that of all comparative classification methods,and can effectively solve the problem of classification performance degradation caused by imbalanced data categories.and WL-TC can still achieve a satisfactory effect in the case of extreme category imbalance.
Keywords/Search Tags:Text classification, Transfer learning, Topic model, Small sample learning, Class imbalance
PDF Full Text Request
Related items