Font Size: a A A

System Development And Design Of Library Document Classification Based On The LDA Model

Posted on:2017-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhaoFull Text:PDF
GTID:2348330518998683Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Multi-label classification,one of the important research topics in the field of machine learning,is divided into two categories:the discriminative algorithms and the generative algorithms.Topic model is one of the best performance algorithms of the different generative algorithms.Based on the study on Prior-LDA model and Dependency-LDA model,the improved model of FP-LDA(Frequency Prior-LDA)model and the re-improved model of Super-LDA are proposed.Consider the problem that the traditional LDA model can not introduce the label information,the researchers put forward the Labeled-LDA model which can one-to-one map the label and the classification to achieve the label text modeling.But the overlook of label occurrence frequency decreased the performance of classification.Some researchers present the Prior-LDA model which training text is first sampled from the distribution of the frequency of the tag in order to solve the problem.But there are two problems in the Prior-LDA model:(1)the generation of label's Dirichlet Prior Distribution is dependent on the limited number of samples,which increases the randomness of the algorithm and decreased the algorithm's classification performance;(2)the time complexity of the algorithm is increased.Based on the Prior-LDA model this paper presents the FP-LDA model,that uses the label occurrence frequency the directly as the weight of the label distribution.The FP-LDA model effectively solves the two problems before.Label correlation is an important information to affect the performance of multi-labelclassification,but Labeled-LDA model,Prior-LDA model and FP-LDA model are not considered it.Therefore,based on the FP-LDA model,this paper considers the label correlation and the structure of the Dependency-LDA model,the improved Super-LDA model is presented.The Super-LDA model introduces a theme layer between the tag layer and the lexical entry layer,each label corresponding to a small amount of private theme through the theme of the co expression of the relationship between the tags.The experimental results show that the FP-LDA model and Super-LDA model improved the performance of multi-label classification and the effect of Super-LDA model is more obvious than the FP-LDA model.Finally,based on the LDA model the library literature classification system is developed and designed.The system proposed the overall design idea,the overall structure,the detailed design of the processing flow and achieved the pretreatment of the word segmentation of text and removal of the stop words and so on.The operation of feature selection,the feature weighting,the function of text training,text classification,classification performance evaluation and so on are realized too.In view of the corpus deviation a separate training model for each of the small amount of data is used to generate the data and then to supplement the original training data set in this paper.
Keywords/Search Tags:multi-label, LDA, topic model, Labeled-LDA, Prior-LDA, Document classification
PDF Full Text Request
Related items