Research And Application Of Distributed Webpage Automatic Classification Algorithm Based On Bayes

Posted on:2016-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:L B Xu

Full Text:PDF

GTID:2298330467492997

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the rapid development of mobile Internet and the rapid technological advances in data collection and data storage, each organization can accumulate massive amounts of data. The extraction of useful information has become a huge challenge. To meet the challenge, data mining techniques and Hadoop cloud computing technology emerges. Webpage automatic classification is an important branch of data mining, especially playing a role in mining "commercial value", for example, it can help mobile operators to answer such a question,"who can be offered the package including traffic and long-distance calls that cost100yuan per month".This paper focuses on the establishment of distributed webpage automatic classification system. The application of Hadoop cloud computing to the webpage classification is one of the highlights of the article. This paper begins with the overview of webpage automatic classification; leading to bayesian classifier and feature selection, whose MapReduce programming was given in form of "block diagram".Then narrates distributed webpage automatic classification system from the point of view of software design; and finally do experimental analysis of classification performance from GB/TB level massive network traffic monitoring data. Innovation in this article can be described as following:(1) Hadoop cloud computing technology is applied to webpage automatic classification. I research on the distributed parallel algorithm of naive bayesian classifier to meet the challenge of classifying GB/TB level massive network traffic monitoring data.(2) Hadoop cloud computing technology is applied to the feature selection of text categorization. I research on the design and realization of MapReduce parallel algorithm of information gain to meet the challenge of selecting feature of GB/TB level massive network traffic monitoring data.(3) The statistical concept of "cumulative probability" is introduced to the parameter optimization problems of feature selection, the optimal threshold of feature vector size can be achieved adaptively by cumulative probability. Feature vector size is not only related to the performance of system software, but also affects the performance of the system classification. This paper proposes a measure to assess the robustness, and verify that the "cumulative probability threshold" program is robust, indicating that it is suitable for different application scenarios.(4) With the combination of software design, Hadoop cloud computing and data mining, distributed webpage automatic classification system can be built by applying the framework of Hadoop cloud computing. The design pattern of "facade" is used to the establishment of the framework of webpage automatic classification system, which divides the system into interface layer, component layer and module layer from top to bottom.

Keywords/Search Tags:

Webpage Automatic Classification, Hadoop CloudComputing, Naive Bayesian Classifier, Feature Selection

PDF Full Text Request

Related items

1	The Research Of Bayesian Classifier And Its Applications
2	Research And Application Of Naive Bayesian Classifier
3	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
4	Implementation Of News Classification System Based On The Naive Bayesian
5	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
6	Chinese Web Pages Based On Naive Bayesian Classification Technology Research And Application
7	Research And Application Of Naive Bayesian Classification Based On Attribute Selection
8	Research On Multi-Dimension Bayesian Network Classifiers Based On Feature Selection
9	Research On Chinese Text Sentiment Polarity Classification Based On Naive Bayesian
10	Bayesian Network Classifiers And Application