Font Size: a A A

Research On Semi-supervised Topic Model For Text Classification

Posted on:2022-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X L MaoFull Text:PDF
GTID:2518306491953099Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
On the data sets with enough and correctly labeled samples,supervised classification algorithms can usually achieve better classification results.However,in practical applica-tions,there are usually few labeled samples,the collection of training data is difficult,and the cost of labeling is expensive.For scenarios where there is only a small amount of la-beled data in the target domain and the source domain contains a large amount of unlabeled data,semi-supervised learning for cross domain can overcome the labeling bottleneck prob-lem,but the data in the target field and the auxiliary source field do not meet independent and identical distribution assumption.The assumption causes the generated classification model to deviate from topics of the target domain,thereby reducing the accuracy of the semi-supervised classifier.Aiming at the scene where only a small number of labeled sam-ples and a large number of unlabeled samples coexist,this paper combines topic models and semi-supervised learning methods to carry out research on semi-supervised topic models for text classification.In view of the above-mentioned problems,the main research contents are as follows:(1)Combining the supervised topic model SLDA and semi-supervised learning method,a semi-supervised topic model s-SLDA(semi-Supervised Latent Dirchlet Allocation)is pro-posed,which trains with a small number of labeled documents in target domain and unla-beled documents in source domain.In addition to the original SLDA parameters?_m??and?,the probabilistic graphical model of s-SLDA introduces three new parameters?_d??and?to establish the document-topic distribution,topic-word distribution,and topic-pseudo-label distribution of source domain.(2)A latent topic sampling method s-SLDA-Gibbs is proposed.This method performs latent topic sampling on documents in the target domain and the source domain according to different constraints,that is,the labeled documents in the target domain are sampled according to the label categories,and the unlabeled documents in the source domain are sampled based on constraints of pseudo-labels.The parameters?_m??????_d??and?of the s-SLDA topic model are calculated.(3)Based on the s-SLDA topic model,a new semi-supervised text categorization method s-SLDA-TC(s-SLDA Text Categorization)is proposed.It is combined with other methods on the 20newsgroup English data set and Sogou Chinese data set.The compara-tive experiment verifies the effectiveness of the s-SLDA topic model for cross-domain.The experimental results show that the s-SLDA-TC method proposed can effectively use infor-mation in source domain to improve the performance of semi-supervised text classification.
Keywords/Search Tags:Semi-supervised learning, Topic model, Latent Dirichlet distribution, Gibbs sampling, Text classification
PDF Full Text Request
Related items