Font Size: a A A

Research On Massive Data Mining Algorithm Based On Cloud Computing Cotton Storage

Posted on:2015-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2208330428481156Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today the amount of data has already jumped from TB level (1024GB=1TB) to PB (1024TB=1PB), EB (1024PB=1EB) and even ZB (1024EB=1ZB) level.Explosive data growth poses serious challenges to the performance of traditional server clusters.The traditional data mining algorithms have been unable to dig out knowledge from big data efficiently.Cloud computing makes computing distributed on a large number of computers.The computing model is ideal for handling large data and can solve the performance bottleneck of traditional computing models effectively. Hadoop is an open source and distributed system infrastructure developed by Apache Foundation and its core is distributed file system (HDFS), MapReduce computation model and HBase distributed database. Hadoop has many advantages, high reliability, high scalability, high efficiency, high fault-tolerant, low-cost and so on, which is mainstream platform of cloud computing for academia and industry research and application now.This paper studies core projects of Hadoop ecosystem, HDFS, MapReduce and HBase, and processes and principles of data mining. The traditional FP-Growth and Naive Bayes algorithms are improved. For the lack of traditional algorithms, this paper combines new solutions proposed with Hadoop for a parallel realization, so that the association rules and classification algorithm can handle massive amounts of data efficiently. Finally we establish a reasonable mining model using the improved algorithms to monitor cotton in case it catches spontaneous fire in the project,"Cotton Warehouse Quality Management". The main contents are as follows,1. This paper proposes the mining algorithm based on dynamic arrays in order to solve the problem that traditional FP-Growth should create conditional FP-Tree in a recursive process of mining frequent patterns which makes time and space efficiency low. The new algorithm is implemented on Hadoop in parallel. The improved FP-Growth algorithm (PLFPG) can efficiently handle large data.2. The traditional Naive Bayes algorithm assumes that the properties are independent and continuous valued attributes obey Gaussian distribution. For the lack of Naive Bayes, this paper combines correlation coefficient with Flexible Bayes and implements it on Hadoop in parallel. That improved algorithm (PCFNB) can efficiently process huge amounts of data with high-accuracy.3. The improved algorithms are applied in the project,"Cotton Warehouse Quality Management". We study the spontaneous combustion characteristics of cotton warehouse in depth and establish a reasonable model to monitor cotton and warn for combustion efficiently and accurately.
Keywords/Search Tags:Hadoop, Association Rules, Classification, FP-Growth, Naive Bayes, CottonWarehousing, Spontaneous Combustion
PDF Full Text Request
Related items