Font Size: a A A

Research And Application Of The Multi-labeled HDP Text Topic Model

Posted on:2018-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhengFull Text:PDF
GTID:2428330542990550Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the Internet resources lead by the text resources present its exponential growth.From such a huge and complex forms of text data,how to through the clear methods and ways to dig out the potential resources,value of the data and customer interested content has got the focus of academic and industrial application research.In order to solve this problem,many scholars deal with text data from association rule mining,text classification,feature selection,text clustering,topic mining and so on.This paper which is based on the textstill uses the modeling method of text hidden theme to construct the topic model,and finds the semantic theme between text to express the core meaning of the text.It aims to provide an improved and useful algorithm for text clustering and classification.Hierarchical Dirichlet Process(HDP)topic model based on Bayesian theory andDirichlet process can be automatically learned the optimal structure of theme set from the data.However,in practice,the best topic-set which dimensionally reduced from the text-set structure is not in accordance with the requirement of the semantic.And some of the existing theme models with labels also need to set the parameters which is very difficult to define.Therefore,a Semi-supervised labeled HDP topic model(SLHDP)and the accuracy evaluation index of random cluster(sk-measure)based on the part of known semantic labels is proposed in this paper.From the parameter definition,variable mapping,model architecture and the derivation of basic formula,the theoretical framework of SLHDP model is constructed gradually.At the same time,the graph model is associated with the physical process of text hidden topic clustering,generating the topic set in Gibbs sampling with Chinese restaurant(CRF)and Stick-breaking model,explaining the model in detail from two aspects in structure and semantics.We apply the SLHDP model to the Chinese and English data sets.A case study of English news data set,the cross validation method is used to discuss the parameters of the model.The model combines with the optimal parameters that obtained from experiments,applying to the test set.At the same time,comparing the three indexes with SLLDA and HDP model,thenthe experimental results show that the SLHDP model can make the composition of topic set more reasonable in the text classification of large scale data sets.In this paper,we also do some work and think about the application of SLHDP model.Extend theapplication scope of SLHDP model that solves some practical problems in other fields.In the end of this text,we summarizes the shortcomings and deficiencies of this thesis,and makes a prospects for the future research work.
Keywords/Search Tags:Label, Semi-Supervised, HDP, Topic Model, Random Cluster
PDF Full Text Request
Related items