Font Size: a A A

Research And Implementation On Fuzzy C-means Algorithm For Big Data In Cloud

Posted on:2015-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:C J YuFull Text:PDF
GTID:2298330452450741Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, interactive applications such asMicroBlog, WeChat and SNS spring up. Data is exploding as cloud-basedapplications on various digital devices rise. Facing with enormous amount of data,traditional data analysis tools can’t mine useful information deeply as it just makessimple processing of the data. So it’s particularly important to excavate valuableinformation from a mass of data. Cluster analysis is one of these big data analyticstechniques. Traditional cluster analysis on stand-alone devices can’t meet thedemands of computational efficiency and complexity in big data analytics. In thiscase, cloud computing provides a new approach to the research of cluster analysis onbig data.In this paper, it makes research on traditional cluster analysis by combiningMapReduce parallel computing model and can make fast and efficient clusteranalysis on big data.The content of this thesis is as follows:(1) Research on methods of big data integrationDiversity is one of the notable features of big data as the types and sources ofdata vary greatly. We need to integrate data from different sources before analysis. Itmakes research on the feature of diversity of big data. It make research on methodsof XML data parsing in a cloud environment by analyzing traditional data integrationsystems based on Web Service and XML. It puts forward a scheme of dataintegration based on Hadoop which can integrate dataset from different sources intoHBase database and can make fast and efficient analysis on the data.(2) Research on Fuzzy-C Means (FCM)Cluster analysis is one of the big data analytics techniques. It makes research onFuzzy-C Means and makes a design to MapReduce.(3) Research on Fuzzy-C Means based on Canopy (Canopy-FCM)It makes research on Canopy algorithm allowing for the feature of high volumeof big data. Canopy is a coarse but fast algorithm which can get a coarse clustering center through few times of iteration. The result by Canopy can be used as the inputof FCM algorithm to accelerate its convergence. It makes research on Fuzzy-CMeans based on Canopy and makes a design to MapReduce.(4) Research on Fuzzy-C Means based on Maximum and Minimum Distance byHash Sampling (HMMFCM)Canopy-FCM is a fast but not accurate clustering algorithm. Traditionalclustering algorithms usually get initial clustering center by maximum and minimumdistance algorithm in order to achieve better effects. As maximum and minimumdistance algorithm can’t be paralleled, it makes research by combining with Hashsampling and puts forward a scheme of MapReduce based on Hash sampling. Thescheme computes the initial clustering center by maximum and minimum distancealgorithm and uses the initial clustering center as the input of FCM algorithm toachieve better clustering effects.
Keywords/Search Tags:Big Data, Cloud Environment, Data Integration, FCM, MapReduce
PDF Full Text Request
Related items