Font Size: a A A

Design And Implementation Of Data Mining Classification System Based On Hadoop

Posted on:2017-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:X G ShenFull Text:PDF
GTID:2348330518494773Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The explosion of information technology forces the data's growth in each passing day.And a large number of structured data and unstructured data distributed in every corner of the Internet.Due to the unprecedented prosperity of data generation and storage,cloud platform based on the resource management,storage and computing emerges at the right moment.As an open source cloud platform,Hadoop is suitable for processing large data sets.It provides the infrastructure of platform,which includes Hadoop Distributed File System and MapReduce computing framework.Meanwhile,people also have an urgently need to extract potential available information from the cloud platform,which manages massive data.Consequently,combined Hadoop platform with text classification in data mining technology,classification work can efficiently reduce the cost of computing time and memory consumption.In summary,it has a very far-reaching significance of the data mining classification system based on Hadoop.Because the text classification is the most common and important data mining classifiction technology.Therefore,this paper,whose main object of study is text classification,aims at resolving the classification quality and time consumption of classification system.Firstly,against to the shortcomings of the traditional Naive Bayesian Classifier(NBC),this paper proposed an improved Attribute Weighted Naive Bayes Classifier(AWNBC).Secondly,by using the MapReduce and the proposed AWNBC,this paper designed and implemented a data mining classification system based on Hadoop.Finally,this paper designed two experimentions which verified that classification system had played a certain role in the optimization of the classification quality and time consumption.In this paper,it has done the following work:1.Based on the relevant study of the text classification,this paper reviewed related technologies involed in its process.2.The paper studied the Naive Bayesian deeply.Then,an improved AW-NBC,which combined the Expected Cross Entropy algorithm with CHI weighted algorithm,has been proposed to against the shortcomings of the traditionl NBC.3.The paper designed a data mining classification system based on Hadoop,which contains the text pre-processing module,feature selection module,text representation module,as well as the proposed Attribute Weighted Naive Bayes Classifier module,and all above achieved by coding.4.By setting up the experimental environment of Hadoop,the paper verified the classification quality and efficiency of the data mining classification system based on Hadoop.The experiments show that the improved AWNBC is superior to the traditional NBC on classification results.The classification system based on Hadoop have less time consumption than the classification system based on single computer in dealing with large-scale data.
Keywords/Search Tags:hadoop, text classification, naive bayesian, attribute weighted
PDF Full Text Request
Related items