Font Size: a A A

Classification Application Of Marine Popular Science Literature Based On Machine Learning

Posted on:2022-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LiuFull Text:PDF
GTID:2480306353957839Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Since the twenty-first century,the rapid development of the network has created the era of information explosion.The popularization of we-media has made the quality of online articles uneven and the credibility of content sharply reduced.How to obtain useful data from massive data has become a problem for Internet users,and the efficiency of Internet users to obtain knowledge of interest has become increasingly low.In view of the above problems,this paper develops an automatic classification system for marine articles based on crawler,which automatically obtains high-quality marine literature,and realizes the automatic classification of marine literature based on machine learning.Finally,it is displayed on the cloud portal of marine application geology.The automation from data source acquisition to classification improves the efficiency of users' access to marine literature.Based on the Web Magic framework,this paper crawls the literature from the marine website and uses regular expressions to filter the URL.The crawled literature is persistent to the local database.This paper uses Cookie and Session mechanism to solve the anti-crawler of some websites and ensure the stability and real-time of the crawler.The research object of this paper is marine popular science literature.In the process of Chinese word segmentation,there are many marine professional names that cannot be recognized by word segmentation tools,and the segmentation effect is not ideal.In this paper,a large number of marine professional names are counted,and these professional names are loaded into the dictionary of knot word segmentation tools,which improves the segmentation accuracy of Chinese text segmentation of marine popular science literature.Marine literature is classified into marine geography,marine geology,marine resources,marine disasters and marine military.TF-IDF vector is used as the bag of words vector in eigenvalue processing,and 170 eigenvalues are selected for dimensionality reduction in each category through experiments.The classifier is selected through the combination of theory and experiment.Firstly,the classification algorithms are compared to find out two most suitable classifiers for this project : Naive Bayes classifier and support vector machine classifier.Then,the classification results of the classifier are compared and verified by experiments.Based on SVM classification algorithm,the bag of words model is used to adjust the relevant parameters,and the Gaussian kernel function is used to optimize the measurement of data similarity.The classification accuracy reaches 88.1 %,which has high practical value for marine literature classification.
Keywords/Search Tags:Oceanography, Naive bayes, Support vector machine, Text classification
PDF Full Text Request
Related items