Design And Implementation Of Data Mining Classification System Based On Hadoop

Posted on:2017-06-21

Degree:Master

Type:Thesis

Country:China

Candidate:X G Shen

Full Text:PDF

GTID:2348330518494773

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The explosion of information technology forces the data's growth in each passing day.And a large number of structured data and unstructured data distributed in every corner of the Internet.Due to the unprecedented prosperity of data generation and storage,cloud platform based on the resource management,storage and computing emerges at the right moment.As an open source cloud platform,Hadoop is suitable for processing large data sets.It provides the infrastructure of platform,which includes Hadoop Distributed File System and MapReduce computing framework.Meanwhile,people also have an urgently need to extract potential available information from the cloud platform,which manages massive data.Consequently,combined Hadoop platform with text classification in data mining technology,classification work can efficiently reduce the cost of computing time and memory consumption.In summary,it has a very far-reaching significance of the data mining classification system based on Hadoop.Because the text classification is the most common and important data mining classifiction technology.Therefore,this paper,whose main object of study is text classification,aims at resolving the classification quality and time consumption of classification system.Firstly,against to the shortcomings of the traditional Naive Bayesian Classifier(NBC),this paper proposed an improved Attribute Weighted Naive Bayes Classifier(AWNBC).Secondly,by using the MapReduce and the proposed AWNBC,this paper designed and implemented a data mining classification system based on Hadoop.Finally,this paper designed two experimentions which verified that classification system had played a certain role in the optimization of the classification quality and time consumption.In this paper,it has done the following work:1.Based on the relevant study of the text classification,this paper reviewed related technologies involed in its process.2.The paper studied the Naive Bayesian deeply.Then,an improved AW-NBC,which combined the Expected Cross Entropy algorithm with CHI weighted algorithm,has been proposed to against the shortcomings of the traditionl NBC.3.The paper designed a data mining classification system based on Hadoop,which contains the text pre-processing module,feature selection module,text representation module,as well as the proposed Attribute Weighted Naive Bayes Classifier module,and all above achieved by coding.4.By setting up the experimental environment of Hadoop,the paper verified the classification quality and efficiency of the data mining classification system based on Hadoop.The experiments show that the improved AWNBC is superior to the traditional NBC on classification results.The classification system based on Hadoop have less time consumption than the classification system based on single computer in dealing with large-scale data.

Keywords/Search Tags:

hadoop, text classification, naive bayesian, attribute weighted

PDF Full Text Request

Related items

1	Application And Research On Beyas Classification Algorithm
2	Research On Weighted Naive Bayesian Classification Algorithm Based On Rough Set Theory
3	Improveing Based On Naive Bayesian Classifier Algorithm
4	Research And Application Of Naive Bayesian Classification Based On Attribute Selection
5	Research About The Selective Naive Bayesian Classification Based On Weighted Attributes
6	Research On Bayesian Networks-Based Text Classification Algorithms
7	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
8	Research On Chinese Text Sentiment Polarity Classification Based On Naive Bayesian
9	Research On Naive Bayesian Classifier Algorithm
10	Text Categorization Based On Naive Bayes Method