Research On Semi-supervised Topic Model For Text Classification

Posted on:2022-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:X L Mao

Full Text:PDF

GTID:2518306491953099

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

On the data sets with enough and correctly labeled samples,supervised classification algorithms can usually achieve better classification results.However,in practical applica-tions,there are usually few labeled samples,the collection of training data is difficult,and the cost of labeling is expensive.For scenarios where there is only a small amount of la-beled data in the target domain and the source domain contains a large amount of unlabeled data,semi-supervised learning for cross domain can overcome the labeling bottleneck prob-lem,but the data in the target field and the auxiliary source field do not meet independent and identical distribution assumption.The assumption causes the generated classification model to deviate from topics of the target domain,thereby reducing the accuracy of the semi-supervised classifier.Aiming at the scene where only a small number of labeled sam-ples and a large number of unlabeled samples coexist,this paper combines topic models and semi-supervised learning methods to carry out research on semi-supervised topic models for text classification.In view of the above-mentioned problems,the main research contents are as follows:(1)Combining the supervised topic model SLDA and semi-supervised learning method,a semi-supervised topic model s-SLDA(semi-Supervised Latent Dirchlet Allocation)is pro-posed,which trains with a small number of labeled documents in target domain and unla-beled documents in source domain.In addition to the original SLDA parameters?_m??and?,the probabilistic graphical model of s-SLDA introduces three new parameters?_d??and?to establish the document-topic distribution,topic-word distribution,and topic-pseudo-label distribution of source domain.(2)A latent topic sampling method s-SLDA-Gibbs is proposed.This method performs latent topic sampling on documents in the target domain and the source domain according to different constraints,that is,the labeled documents in the target domain are sampled according to the label categories,and the unlabeled documents in the source domain are sampled based on constraints of pseudo-labels.The parameters?_m??????_d??and?of the s-SLDA topic model are calculated.(3)Based on the s-SLDA topic model,a new semi-supervised text categorization method s-SLDA-TC(s-SLDA Text Categorization)is proposed.It is combined with other methods on the 20newsgroup English data set and Sogou Chinese data set.The compara-tive experiment verifies the effectiveness of the s-SLDA topic model for cross-domain.The experimental results show that the s-SLDA-TC method proposed can effectively use infor-mation in source domain to improve the performance of semi-supervised text classification.

Keywords/Search Tags:

Semi-supervised learning, Topic model, Latent Dirichlet distribution, Gibbs sampling, Text classification

PDF Full Text Request

Related items

1	Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models
2	News Topic Detection Research Based On Semi-supervised DPMM
3	Theme Of Model-based Expert Retrieval And Mining
4	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow
5	Research And Application Of Text Classification Model Based On Topic Model
6	Research On Deep Learning Text Classification Based On Fusion Topic Features
7	Research On Classification Algorithm Of Scientific Papers Based On Topic Model
8	Study Of Text Evolution Analysis And Prediction Based On Topic Model
9	Research On Text Categorization Based On LDA
10	Research And Implementation Of Spark-based Text Classification