A Novel Top-k Query Algorithm Based On MapReduce Framework

Posted on:2015-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:N Wang

Full Text:PDF

GTID:2298330431484132

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the growing scale of data and the increasing complexity of data types,mankind has entered the era of big data. On one hand, the amount of available datafrom different areas and applications is increasing sharply, on the other hand,traditional data processing technologies are unable to deal with such a large scale ofdata, thus brings a serious challenge to the current data processing technology. We areon an urgent demand to effectively dig out the useful information we need, thus wehave to put forward a faster and more effective query technique.Ranking queries, also known as top-k query, is one of various complex dataqueries. It works by calculating the scores of all objects by an aggregation function,and then return the top k objects with the highest scores. Top-k query has been vastlyused in many fields, such as information retrieval, multimedia databases, data mining,web searching and so on. For many interactive systems, efficient top-k query arecrucial. Top-k query problem also involves a lot of database processing technologies,including indexing and query optimization, etc. Top-k query is now drawing more andmore attention for the wide use and the efficiency, so it is now a hotspot in researcharea, and it is also has very important theory value and the very extensive applicationprospects.The severe challenge of the current large data age，also brings new bottlenecks tothe top-k query processing technology. In the face of large-scale data, the traditionaldatabase management technologies are no longer applicable. The new large datadistribution system, results in the inapplicable of traditional data query algorithms. Inaddition, top-k query algorithm performance requirements are stricter, especially formany real-time systems, whose communication costs and time costs must be limited.Therefore efficient top-k query algorithm suitable for large-scale data needs to beraised urgently.Currently, the academic and all sectors of society have done many researches on the top-k query algorithm. Top-k algorithms applied in the centralized database andsmaller distributed systems are gradually raising and maturing. Typical algorithmsthat are applicable to the centralized database include FA and TA. While thealgorithms used in small-scale distributed system are mostly improved from TAalgorithm. However, for the distributed systems with an enormous database, thesealgorithms are no longer applicable.In this paper, I mainly researched on the corresponding top-k query algorithm forlarge-scale data, and then made some innovation. Learning from the good features ofthe traditional top-k query processing technology and using parallel programmingmodel-MapReduce, I proposed a top-k query algorithm based on MapReduce. Thealgorithm is a preprocessing algorithm that requires a pre-established global indextable COIT (Candidate Objects Index Table). Which can significantly reduces dataexchange between nodes, and greatly improves the efficiency of top-k queries. Andbesides, the introduction of MapReduce improves parallelism of the algorithm, whichgives the algorithm an ability to achieve linear expansion. In addition, I also built aHadoop platform to test the performance of the new algorithm in this paper.Experiments show that the algorithm has good performance and good scalability.

Keywords/Search Tags:

MapReduce, Top-k, Big data, Que ry optimization, COIT

PDF Full Text Request

Related items

1	Implementation And Optimization Of Parallel Clustering Algorithm Based On MapReduce
2	The Application Case Study Of Mapreduce Parallel Computation And The Optimization Of Its Runtime Framework
3	Research On Performance Optimization Of MapReduce Model
4	Study On Performance Optimization Of MapReduce
5	Scheduling Optimization Research For MapReduce
6	Optimization For The Data Access Mode Of Mapreduce In HBase
7	MapReduce Job Oriented Collaborative Optimization On Cloud Data Center Network Resource
8	Research On Optimization Technology Of Data Parallel Processing Based On MapReduce
9	Research And Optimization Of Join Algorithm Based On MapReduce
10	Research On Data Cube Technology Based On MapReduce