Font Size: a A A

Research And Implementation Of Data Classification Algorithm Based On Decision Tree

Posted on:2017-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:C Y LiuFull Text:PDF
GTID:2348330518495806Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity and mature of mobile Internet,cloud computing technology and Internet of things,there are vast amounts of data everyday,the data sources are also complex and diverse.From these big data,how to get the information which is good for business and person is an important issue we need to face today.Classification is an important data mining task,it provides a solid foundation for the subsequent clustering,correlation analysis and other tasks.So data classification in data mining technology has important research value.This paper introduces the concepts,processes,and the development of mining technology,and it also analyzes the data classification task carefully.As one of the classic classification algorithm,SPRINT algorithm has been widely used in today.On the basis of presentation SPRINT algorithm and problems,the paper improves the method of looking for the best split point.For discrete and continuous attribute,article puts forward the two new data structure of classification table and merge partition table to reduce unnecessary operation and the number of candidate nodes.These improvements can shorten the time of constructing decision tree and optimize the overall performance of the algorithm.However,when the traditional data classification algorithm faces with large data sets,their computing and storage capacity can't achieve the ideal effect.The rise and development of cloud computing technology provides an opportunity to solve this problem,its high flexibility,high scalability,low resistance and high reliability of cluster resources provides the underlying convenient services for data mining.So the article combines the data classification with cloud computing technology.Based on the analysis of large-scale data classification processing demand,the paper proposes data classification model based on Hadoop platform.However,the paper combines the Hadoop framework and data classification technology by proposing the model's needs,basic structure and function module.The paper also improves the algorithm tier of system and optimizes the SPRINT algorithm by sort parallelism,node parallelism and property parallelism.These improvement make the SPRINT algorithm transplant to Hadoop platform perfectly.Finally,the paper tests the efficiency of the improved SPRINT algorithm by setting up platform.It proved that the improved algorithm can effectively reduce the data processing time and improve overall system performance,so that the system can be as high concurrency,low-cost,highly reliable complete data classification tasks.
Keywords/Search Tags:data mining, cloud computing, Hadoop, data classification, SPRINT algorithm
PDF Full Text Request
Related items