The Research And Implementation Of Parallelism Of Information Retrieval Related Algorithms Based On Mapreduce

Posted on:2013-06-18

Degree:Master

Type:Thesis

Country:China

Candidate:T Xiao

Full Text:PDF

GTID:2248330371987996

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the growing population and fast development of Internet, the scale of data on Internet is increasing exponentially and information explosion has been one of the epochal features. As an important access point of Internet, search engine is playing an more and more important role in helping users find information they need precisely from massive scale of Internet, and people depend increasingly on search engine in life and production. The searching object of search engine is the total Internet, including web pages, graphs, music, vedio, FTP resource and so on. The massive data on Internet has brought a big challenge for the efficient running of information retrieval system:on one side, limited by the CPU clock frequency, memory capacity, disk I/O rate, and network band width, etc.,a single computer can not finish processing all the data alone; on the other side, the massive data is distributed over the whole Internet instead of stored in a single computer or database, which means that thousands or even tens of thousands of computers need to process these data together in a coorperational way. As a result, proposing parallel algorithms for search engine that can process massive data on Internet efficiently has become a common target of academia and industry. The research in parallel computing domain has made a great progress in the past decades, and some classical parallel computing platforms have been proposed, including MPI, OpenMP, OpenCL and CUDA, etc. Among these platforms, the MapReduce parallel computing paradigm proposed by Google in2004provides parallel computing a simple and efficient computing paradigm and environment by its excellent scalability, reliability and ease of use. MapReduce makes it easier to transform parallel computation from theory to application.The traditional algorithms in information retrieval domain have become increasingly mature, however, some algorithms are not designed especially for parallel computing environment, which means they cannot process massive data directly or finish processing massive data in an acceptable time period. They will run on multiple computers in parallel when parallelized, which can enhance the efficiency of processing massive data, respond to the searching needs of users faster and improve uses’searching experience. Query suggestion and page rank are two important research issues in information retrieval domain:query suggestion can help users get search results more precisely in a shorter time period, while page rank can help improve searching experience and help users find their target pages easily. If we can parallelize some of these serial algorithms in information retrieval domain and make them run in a cluster in parallel, then the processing capacity of search engine for massive data can be enhanced effectively, the update cycle of query suggestion and page rank can be shorter, and users will be more satisfied with the searching process.In this paper, we studied the QUBIC algorithm in query suggestion domain and page rank algorithms based on frequent sets mining in the research background of parallel processing of massive Internet data, and parallelized QUBIC and frequent sets mining based page rank algorithm based upon MapReduce parallel computing paradigm, which makes QUBIC and page rank algorithms be able to run in MapReduce parallel computing framework. Besides we implemented a prototype system in Hadoop. Specifically, our work mainly consists of these parts below:(1) We parallelized QUBIC algorithm based on MapReduce parallel computing paradigm and proposed specific methods for data distribution and parallel computation, including the distributed storage of search engine log files, the construction of Query-URL bipartite, the computation of Jaccard similarity coefficient, the generation of QAG, the computation of connected components in QAG and the ranking of Query.(2) We parallelized traditional SON frequent sets mining algorithm based on MapReduce parallel computing paradigm and proposed PSON algorithm which is later applied to mine frequent URLs by computing aset of related URLs from searching results and displaying them according to their order of importance.We implemented our parallelization design of original algorithms in Hadoop parallel computing platform and conducted experiments. Experimental results proved that our parallelization of original algorithms is effective which can achieve good performance in scale-up and speed-up. Finally we implemented a prototype system where we demonstrated the running procedure of parallel QUBIC algorithm and frequent URL sets mining algorithm, which proved the correctness and efficiency of these two algorithms.

Keywords/Search Tags:

information retrieval, parallel computing, large-scale massive data, QUBIC, frequent sets mining, MapReduce, Hadoop

PDF Full Text Request

Related items

1	Parallel Frequent Itemset Mining Based On MapReduce
2	Hadoop Based Parallel Genetic Algorithm For Large Scale VRPTW
3	Research On Distributed SVM Algorithm Based On Hadoop Platform
4	Frequent Subgraph Mining In Graph Databases Based On MapReduce
5	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
6	Massive Network Traffic Data Mining Based On Large Scale Graph Analysis
7	Design And Implementation Of Clustering Algorithm For Large Scale Chinese Short Text Based On Mapreduce
8	Research On Data Management And Parallel Docking In Virtual Screening Based On Hadoop
9	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
10	Study On Parallel Mining Frequent Itemsets Over Uncertain Database Based On Hadoop