Font Size: a A A

The Important Technologies’ Research Of Big Data Organization And Management In Distributed Environment

Posted on:2015-09-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z K ChenFull Text:PDF
GTID:1108330509460970Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer technologies and popularity of electronic equipment, the world has been in the digital era, and the information data is everywhere in life.The world has been in the era of information explosion. The data of distributed application is increasing in an unprecedented rate because of the increment of person and equipment.A lot of distributed applications have to face the challenges of big data, especially the Internet applications. In the distributed environment big data system has to higher operations’ performance than traditional systems. In order to improve the performance of system, designers not only can optimize the technology of data processing, but also can optimize the data management platform of application. The data organization and management of platform can directly affect the performance of system and operations, so the research on bit data organization and management in the distributed environment is quite valuable from both theoretical and practical perspective.In the distributed environment, big data has the following characterizes: Volume(the data scale is large), Velocity(the increment rate of data is fast), and Variety(the data structure is variety) and so on, which will take some new challenges for the data organization and management. The works of this thesis focus on data partition, fragment allocation and re-allocation, and the index technology based on analyzing relative works.In detail, we highlight the main contributions of this thesis as follows.1)In the distributed environment, the big data system needs to have the ability to support multi-dimensions queries, loading data quickly, and also guarantee the high scalability of system. We studied a new data partition strategy based on hybrid range consistent hash. First, we select some attributes which are always queried in the multidimensions queries, and use some linearize technology to generate one dimension key which will be the key of the table. After that, we will use consistent hash to allocate data to the node clusters. Finally, the data in every node clusters will use range partition strategy to partition data. We use the Yahoo! Cloud Service Benchmark(YCSB) to verify the efficiency of HRCH. As the experimental results indicated, the strategy has lower performance than traditional strategy in some case, but it can improve the ability of loading data, the system’s scalability, and the queries types which can support by the system.2) The load of system directly affects system’s performance. In the distributed environment, the computing model of big data system has been changed, so it will affect the load balance strategy. In this thesis, we studied a new fragment allocation strategy based on load aware(LAFAS). The computing model has been transferred from “data close to computing”to “computing close to data”, so the location of fragments will directly decide the location of computing, and it will directly affect the nodes’ load of system. In order to balance system’s load, LAFAS will allocate the new fragments which are inserting into the system. First, we use information entropy theory to calculate the weight of factors which can affect system’s load, so we can calculate system’s load. After that, we can cut the node cluster which can store the new fragment according to the load of every nodes. At last, we use the initial allocation strategy to allocate the new fragment in the new node cluster. So the new fragments will not be allocated to nodes which have heavy load, which can regulate system’s load. We use some simulate experiments verify the efficiency of LAFAS. As the experimental results indicated, the strategy can regulate the load of the system to balance system’s load, and it can improve operations’ performance of the system.3) In the distributed environment, improving high parallel degree of operation not only can not improve operations’ performance, but also increasing communication cost of operation. We studied a new re-allocation strategy based on hypergraph model(FASBH)for this problem. The computing model has been changed, so we have to store fragments which are accessed by an operation to the same node to decrease the network communication cost. The proposal of FASBH is based on this considering. First, we select some representative history operations as training sample. After that, we use the hypergraph model to describe the relative degree of fragments. Third, we use hypergraph partition algorithm to iteratively partition the fragment hypergraph, which can reduce the communication cost while guaranteeing the parallel degree of operations. Finally, we migrate fragments with the lowest migrating cost. As the experimental results indicated, the hypergraph model can describe the fragment correlate better than model graph, and operations’ performance of system in the new re-allocation strategy is better than traditional strategy(such as strategy based on graph model).4) Microblogging system is a special application of big data, but the efficiency of real-time index is low. We studied a new real-time distributed index based on topic(RDIBT) for this problem. First, RDIBT uses the topic classification technology to infer the topic of every microblogs. After that, we create the index of this new microblog in its topic index. Every topic index is a hierarchic index and the new incoming microblog is just operated in the lowest layer’s index, which can guarantee the efficiency of update index. And the lower layer’s indexes will merge to the higher layer in a batch manner,which can guarantee the update efficiency of index. Finally, system will stored topic indexes in distributed which can improve the performance of search according to process the search in parallel. We use the real Twitter dataset to verify the efficiency of RDIBT.As the experimental results indicated, the index update efficiency and the search performance of RDIBT are both higher than LSII, and it also can guarantee the scalability of index system.In summary, this thesis analyzes the challenges and requirements for system’s performance and scalability of big data system in distributed environment. After that, some key techniques, including data partition, fragment allocation and re-allocation, and real time index, was intensively studied. These techniques are interesting and useful, and have brilliant perspective on improving the performance of big data application system in distributed environment.
Keywords/Search Tags:Distributed Environment, Big Data, Data Organization and Management, Data Partition, Fragment Allocation, Fragment Re-allocation, Real-time Index
PDF Full Text Request
Related items