Font Size: a A A

Research On Automatic Classification Of Chinese Books Based On Ensemble Learning

Posted on:2020-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:M J ZhouFull Text:PDF
GTID:2438330602952143Subject:Library science
Abstract/Summary:PDF Full Text Request
With the advancement of digitalization and automation of library daily work and the increasing number of books in libraries,manual means to solve the problem of bibliographic classification have become inadequate.It is urgent to introduce automatic classification system into the task of Chinese books classification.Therefore,this paper attempts to construct an automatic classification system of Chinese books to achieve efficient classification.The Chinese books automatic classification system mainly includes data preprocessing,feature extraction,text representation and classification algorithm selection.Then describe the working principle of each part and set the relevant parameters.The Chinese books automatic classification system built by predecessors is mainly based on the traditional word bag model,and there is no application ensemble learning algorithm framework with high classification accuracy.This paper improves the text representation and classification algorithm.In terms of text representation,this paper compares the word frequency model and TF-IDF model in the traditional word bag model,the difference between Word2 vec model and Glo Ve model in the distributed representation method in the ability of Chinese books representation,and finds that the distributed representation method is much better than the traditional word bag model in the ability of books representation through experiments.By adjusting the weight ratio of title and abstract,it is found that the highest weight ratio of books representation ability is 1:4.Finally,a distributed hybrid representation model is proposed,which combines the different representations of Word2 vec and Glo Ve,and combines the books vectors generated by both of them with the same weight to obtain the best bibliographic representation ability.In the selection of classification algorithm,an ensemble learning framework is introduced.By comparing the ensemble classification effects of different base learners,support vector machine,decision tree,naive Bayesian and back propagation neural network,an efficient automatic Chinese books classifier is obtained.Experiments show that under the framework of ensemble learning Bagging,the classification accuracy of back propagation neural network algorithm is 90.19%.The Chinese books automatic classification system constructed in this paper is applied to multi-level Chinese books automatic classification tasks,and it is found that the accuracy of low-level classification is higher than that of high-level classification.The main reason is that the distribution uniformity of samples at different levels is different.Finally,the influence of the number of samples and the number of categories on the classification accuracy is analyzed through experiments.The conclusion is that the more samples,the higher the classification accuracy.When the number exceeds 40,000,the classification accuracy remains basically stable;the more categories,the lower the classification accuracy.Through many experiments,it is proved that the Chinese books automatic classification system constructed in this paper has a high classification accuracy,and can be applied to the automatic classification of Chinese books work in libraries,which provides a new solution to the problem of books classification.
Keywords/Search Tags:Chinese books classification, ensemble learning, distributed representation, machine learning
PDF Full Text Request
Related items