Font Size: a A A

Research Of Large-scale Data Mining Technologies On MapReduce

Posted on:2014-05-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q H LiFull Text:PDF
GTID:1108330464464389Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of Internet and information system cause the appearance of big data. How to do data mining efficiently is challenging for big data. Many traditional serial mining algorithms can’t satisfy the requirements and parallel data processing techniques are needed. With the rapid increment of the amount of data, the structure becomes more com-plex. The appearance of massive graph data and massive high-dimensional data challenge the traditional data mining algorithms. Massive graph data is quite common now, for example, the social network.the web page network, social median data,etc. High-dimensional data is common too, for example, text,video and audio on Internet. We need a distributed platform to parallelize the data mining algorithms of both graph data and high-dimensional data.MapReduce is one of the most popular cloud computing platforms and afterwards several platforms conforming to this paradigm appear such as Hadoop、Pig、GreenPlum AsterData. MapReduce attracts both industrial and academic fields because the computation can be dis-tributed to hundreds and thousands of computers evenly. Google has implemented its search engine based on MapReduce model. Hadoop, as the most famous open-source Map/Reduce implementation, has been adopted by Yahoo!, Facebook, and other companies for large-scale data analysis. But original Hadoop can’t support graph applications well because of the depen-dency among the subgraphs of the whole graph. We do the research on two aspects. First, we modify Hadoop to support more mining algorithms. Secondly, we design and optimize mining algorithms to suit MapReduce paradigm. For example, we implement the PageRank algorithm on massive graphs and clustering of high-dimensional audio data on Hadoop. To fill in the gap between the expectation and the reality of the massive data mining efficiency, we explore the massive data mining problems from both the platforms of cloud computing and the mining algorithms themselves. The main contributions are as below:1. We propose Locality Iteration MapReduce model(LI-MR) to support iterative data min-ing algorithmsBecause MapReduce model lacks efficient support for iterative data mining algorithms which can benefit from locality, we propose Locality Iteration MapReduce model(LI-MR). We implement LI-MR model in two ways. First, we extend Hadoop by adding APIs to fulfil cache and index functions; Second, we integrate Hadoop with HBase to support data random access. The main ideas of LI-MR model include adopting coarse granuality for processing unit, the communication among data blocks, supporting of memory computation, supporting of local computing.2. We propose MR-LPA algorithm to mine social networksThe problem of massive graph partitioning always attracts attention of researchers. As an applicaiton of graph paritioning, the community detection requires time urgency to satisfy online requests. Label Propagation Algorithm(LPA) is a quick community detec-tion algorithm which is nearly linear. However, it is slow for massive social network. We propose a parallel LPA on Hadoop to reduce the running time.3. We propose LI-PageRank algorithm conforming to LI-MR modelPrevious implementations of the PageRank algorithm on MapReduce ignore the char-acteristic of locality in distributed systems which is very important to reduce the I/O and network costs. We propose LI-PageRank algorithm to explore the locality proper-ty by supporting a subgraph as an input record for map functions. Graph partitioning techniques and a message grouping method are employed to guarantee the efficiency of communication among different subgraphs.4. We improves the efficiency of k-means algorithm by using LSH technologiesWe explore the clustering problem for massive high-dimensional datasets using locality-sensitive hashing techniques. In this paper we improve the performance of k-means algorithm in three aspects by adopting locality sensitive hashing technologies. First, we group similar features together and treat a group as a unit to be assigned, by this way, we reduce the number of features to be compared. Second, we improve the quality of selecting centers by sampling different buckets produced by LSH. Thirdly, we prune unnecessary comparisons by considering only the centers in close buckets.
Keywords/Search Tags:data mining, MapReduce, cloud computing
PDF Full Text Request
Related items