Research On Massive Data Mining Algorithm Based On Cloud Computing Cotton Storage

Posted on:2015-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:X Wang

Full Text:PDF

GTID:2208330428481156

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Today the amount of data has already jumped from TB level (1024GB=1TB) to PB (1024TB=1PB), EB (1024PB=1EB) and even ZB (1024EB=1ZB) level.Explosive data growth poses serious challenges to the performance of traditional server clusters.The traditional data mining algorithms have been unable to dig out knowledge from big data efficiently.Cloud computing makes computing distributed on a large number of computers.The computing model is ideal for handling large data and can solve the performance bottleneck of traditional computing models effectively. Hadoop is an open source and distributed system infrastructure developed by Apache Foundation and its core is distributed file system (HDFS), MapReduce computation model and HBase distributed database. Hadoop has many advantages, high reliability, high scalability, high efficiency, high fault-tolerant, low-cost and so on, which is mainstream platform of cloud computing for academia and industry research and application now.This paper studies core projects of Hadoop ecosystem, HDFS, MapReduce and HBase, and processes and principles of data mining. The traditional FP-Growth and Naive Bayes algorithms are improved. For the lack of traditional algorithms, this paper combines new solutions proposed with Hadoop for a parallel realization, so that the association rules and classification algorithm can handle massive amounts of data efficiently. Finally we establish a reasonable mining model using the improved algorithms to monitor cotton in case it catches spontaneous fire in the project,"Cotton Warehouse Quality Management". The main contents are as follows,1. This paper proposes the mining algorithm based on dynamic arrays in order to solve the problem that traditional FP-Growth should create conditional FP-Tree in a recursive process of mining frequent patterns which makes time and space efficiency low. The new algorithm is implemented on Hadoop in parallel. The improved FP-Growth algorithm (PLFPG) can efficiently handle large data.2. The traditional Naive Bayes algorithm assumes that the properties are independent and continuous valued attributes obey Gaussian distribution. For the lack of Naive Bayes, this paper combines correlation coefficient with Flexible Bayes and implements it on Hadoop in parallel. That improved algorithm (PCFNB) can efficiently process huge amounts of data with high-accuracy.3. The improved algorithms are applied in the project,"Cotton Warehouse Quality Management". We study the spontaneous combustion characteristics of cotton warehouse in depth and establish a reasonable model to monitor cotton and warn for combustion efficiently and accurately.

Keywords/Search Tags:

Hadoop, Association Rules, Classification, FP-Growth, Naive Bayes, CottonWarehousing, Spontaneous Combustion

PDF Full Text Request

Related items

1	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop
2	Research And Application On Naive Bayes Classification Algorithm
3	Tree Augmented Naive Bayes Classifier Based On Attributes Reduction Using Association Rules And Its Applications
4	Research On Naive Bayes Classifiers And Its Improved Algorithms
5	Research On Association Rules Algorithm Based On Hadoop
6	Chinese Web Pages Based On Naive Bayesian Classification Technology Research And Application
7	Research And Application Of User Health Classification Method
8	Text Categorization Based On Naive Bayes Method
9	Mining Association Rules Algorithm Analysis Based On Hadoop
10	Research And Implement On Data Mining Algorithm Parallel Based On Hadoop