Font Size: a A A

Design And Implementation Of Enterprise News Information Classification Subsystem In Distributed Environment

Posted on:2018-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:B X XuFull Text:PDF
GTID:2348330515960853Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of the Internet,a variety of news spring up at an increasing rate,the news information plays more and more important role in people's life and culture.How to collect and sort out a large number of news data,which is the main research of this dissertation.This dissertation presents and designs a news classification system for enterprise,which has the functions of news gathering,information processing and news presentation.Enterprise users can use the system to obtain information related to their industry quickly and accurately.First of all,the network crawler module is designed,The crawler software is programmed with breadth first algorithm.Corporate news can be collected and identified efficiently in this module.Secondly,the dissertation designs and implements the text classification module.We use the Bayesian algorithm to classify the news text.In the process of classification,the text preprocessing,feature selection and vectoring require a lot of computation,and in the training model process,there are issues of a long training time and the limited storage capacity of databases and so on.In order to solving the above problems,we build the Hadoop open and distributed computing platform,and use the MapReduce parallel computing model to distribute the parallel design in different stages,and establish the Hive data warehouse for solving the problem of large storage space.When news data is large,the traditional Bayesian method needs to re-learn all the previous sample data,which will take a lot of time and the operation is quite troublesome too.In this dissertation,we introduce the incremental learning method,design and implement an incremental Naive Bayesian algorithm,which does not need to re-train data,just modify the original data.Finally,a classification subsystem is designed for enterprise news information,including information acquisition,text preprocessing,feature extraction,classifier construction,classification performance evaluation and incremental learning.We test the function of several modules.This system uses the crawler to get the news information,and classifies it in Hadoop environment.The results show that,the incremental classifier increases the accuracy of the algorithm by about 4%compared to traditional Bayesian classifier in the case of large-scale news information,and indicates better efficiency and high scalability.This dissertation gives the realization of news classification algorithm,it has referential significance for text classification in other fields.
Keywords/Search Tags:news classification, naive Bayesian, feature extraction, incremental learning, MapReduce
PDF Full Text Request
Related items