Font Size: a A A

Research On Automatic Classification Model Of Papers Based On Machine Learning Model

Posted on:2019-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:R Q JiaFull Text:PDF
GTID:2428330572497370Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the development of digital libraries,the number of papers published each year is getting larger and larger.To facilitate everyone's access to study,the classification management of papers has become an urgent problem to be solved.The traditional manual classification is not only time consuming,but also results in biased classification results due to the subjective factors of the classification workers.Therefore finding a suitable machine learning model to achieve the automatic classification of papers is the best way to solve this problem.This article mainly analyzes the differences between the easily misjudged papers and the correctly classified papers,and finds the method of optimizing the models,so as to obtain an ideal classification model and paper classification management scheme.This paper selects 7000 master's thesis as sample data from the China Zhiwen.com according to the number of papers indexed.The word segmentation of the paper is carried out by the word segmentation package in Python,and the weight of each feature word is calculated by the TF-IDF algorithm.Random forest algorithm,support vector machine algorithm and AdaBoost algorithm are the three most widely used models in the field of text mining.This paper adopts a cross-validation method and chooses from these three models with the classification accuracy and model training time as the evaluation index.The most suitable model.After making a preliminary forecast,it was found that the title,keywords,and feature words included in the abstract were misjudged,and that the variables were insignificant and were misclassified into relatively close categories.This paper supplements the paper by constructing the knowledge map model and citing the feature words in the paper that are closest to the paper.Once again,the prediction of the optimized model is improved,but the accuracy of the forecast is obviously improved,but it still cannot meet the actual needs.After analyzing the results of the classification again,it is found that the probability that three thesis papers of mathematics,physics,and geophysics are misclassified by each other is relatively high,and the probability that four papers of finance,accounting,insurance,and investment are misclassified by each other is relatively large.This article refers to the previous classification method and divides the paper into two major categories,namely the science and finance,and then subdivides the major categories.The accuracy of the model classification eventually reached more than 90%.Finally,an ideal paper automatic classification model was obtained.The research results show that the non-standardized writing of the paper will lead to great differences in the feature words contained in the paper,which will affect the classification effect of the paper;the method of using the feature words in the same instructor to fill in the feature words in the paper is a good one.Effective;construct knowledge map model is conducive to the classification management of the paper.This paper proposes relevant suggestions that are conducive to the optimization of the paper's automatic classification model and the classification management of the papers in the digital library through the research conclusions.This is not only beneficial to the research of the paper's automatic classification model,but also has a strong practical significance for the classification management of digital libraries.
Keywords/Search Tags:machine learning, knowledge map, automatic classification, TF-IDF algorithm
PDF Full Text Request
Related items