Font Size: a A A

Research And Application Of Distributed Webpage Automatic Classification Algorithm Based On Bayes

Posted on:2016-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:L B XuFull Text:PDF
GTID:2298330467492997Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet and the rapid technological advances in data collection and data storage, each organization can accumulate massive amounts of data. The extraction of useful information has become a huge challenge. To meet the challenge, data mining techniques and Hadoop cloud computing technology emerges. Webpage automatic classification is an important branch of data mining, especially playing a role in mining "commercial value", for example, it can help mobile operators to answer such a question,"who can be offered the package including traffic and long-distance calls that cost100yuan per month".This paper focuses on the establishment of distributed webpage automatic classification system. The application of Hadoop cloud computing to the webpage classification is one of the highlights of the article. This paper begins with the overview of webpage automatic classification; leading to bayesian classifier and feature selection, whose MapReduce programming was given in form of "block diagram".Then narrates distributed webpage automatic classification system from the point of view of software design; and finally do experimental analysis of classification performance from GB/TB level massive network traffic monitoring data. Innovation in this article can be described as following:(1) Hadoop cloud computing technology is applied to webpage automatic classification. I research on the distributed parallel algorithm of naive bayesian classifier to meet the challenge of classifying GB/TB level massive network traffic monitoring data.(2) Hadoop cloud computing technology is applied to the feature selection of text categorization. I research on the design and realization of MapReduce parallel algorithm of information gain to meet the challenge of selecting feature of GB/TB level massive network traffic monitoring data.(3) The statistical concept of "cumulative probability" is introduced to the parameter optimization problems of feature selection, the optimal threshold of feature vector size can be achieved adaptively by cumulative probability. Feature vector size is not only related to the performance of system software, but also affects the performance of the system classification. This paper proposes a measure to assess the robustness, and verify that the "cumulative probability threshold" program is robust, indicating that it is suitable for different application scenarios.(4) With the combination of software design, Hadoop cloud computing and data mining, distributed webpage automatic classification system can be built by applying the framework of Hadoop cloud computing. The design pattern of "facade" is used to the establishment of the framework of webpage automatic classification system, which divides the system into interface layer, component layer and module layer from top to bottom.
Keywords/Search Tags:Webpage Automatic Classification, Hadoop CloudComputing, Naive Bayesian Classifier, Feature Selection
PDF Full Text Request
Related items