Font Size: a A A

Research And Improvement On Data Classification Algorithms In Cloud Environment

Posted on:2017-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:S Y MoFull Text:PDF
GTID:2348330518995717Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of computer technology,the society has been increasingly informationized and electronized.Variety of fields such as commerce,medical treatment,meteorology and astronomy,has accumulated a large amount of data,even just people's daily life can also produce vast amounts of data.Simple data accumulation will not bring any value.But using the particular technology to deal with mass data,even if just a simple sorting method,can also make the big data present interesting rules.Then using the corresponding search,recognition and other technology is likely to figure out which contains the data of the treasure.Data mining is a subject with the generation of big data,and data classification algorithm as an important part in this field,has high learning value and strong practical significance.In this paper,the deep research has been done on several typical classification algorithms,such as K-Nearest Neighbor algorithm,Naive Bayesian algorithm,Support Vector Machine algorithm.In the process of algorithm research,the work inevitably encountered the speed bottlenecks of data processing employing single machine,so it's necessary to turn to distributed computing for the corresponding solution.Cloud computing platform as the super container of mass data's storage and processing,become the best carrier for data classification algorithm.This paper presents a cloud environment improvement of data classification algorithm,mainly for KNN,NB and SVM algorithm,making them adapt to the Hadoop platform framework of data processing,being an actual application platform for text classification.In this paper,the classification platform fully considering the main characteristics of the traditional classification algorithms,and the advantages of the leading edge of cloud computing platform technology,complementing each other as a whole.The overall platform framework including the text pretreatment module,training module and data test module,completed the Chinese word segmentation,removing stop words,text feature representation and algorithm parallelization function.Finally,through the simulation experiment,it shows that the classification platform implements the algorithm performance improvement,greatly shortens the time of data classification.Especially in the case of large amount of data,it reflects the higher classification accuracy and greater superiority of data processing speed.
Keywords/Search Tags:cloud computing, hadoop, classification algorithm, parallelization
PDF Full Text Request
Related items