Font Size: a A A

Research On Key Technologies Of Massive Network Data Processing Platform Based On Hadoop

Posted on:2015-08-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:W H LinFull Text:PDF
GTID:1228330467963634Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Recent years, with the advent of cloud computing technologies, Big Data is being more and more sophisticated. At the same time, China’s domestic mobile internet market has attracted more than0.5billion mobile end users, which generate tremendous amount of data day by day. In these circumstances, how to leverage cloud computing capability to store, process and analyze the data has become a very valuable topic.From telecom operators’perspective, the tremendous data forced them to invest and upgrade network equipment and compute resources to ensure their service quality. But on the other side, the emerging data business is cutting off the profit of traditional voice and messaging business, which leads the telecom operators into a low profit vicious circle. So how to take advantage of mobile internet data traffic is significant to telecom operators. This paper analyses the traits of mobile internet data traffic and research key technologies in a massive network data processing platform based on Hadoop. Specifically, the main content and achievements can be described as follows.1. Massive data processing architecture for mobile Internet is constructed.We propose a secure cloud computing platform for distributed data collection, massive network data storage and data analysis. This platform is optimized for massive data processing in mobile internet environment, which comprises four modules-data acquisition, data storage, data processing and traffic security detection. It perfectly fits into the life cycle from data collection to data processing. Besides, to improve the security and efficiency of the platform, we also consider and implement massive data storage, efficient data processing and rapid anomaly detection based on cloud computing technology. Our optimization has been proved by experiments and practices, especially the security and service quality are heightened. And we will focus on the research of some key technologies based on the platform architecture.2. Highly reliable data collection framework based on distributed fault detection mechanism is proposedData collection is the first step in the whole process. We need to ensure the integrity and credibility of the data. Otherwise, any followed up effort will be meaningless. So we listed and analyzed the characteristics and difficulties of data collection in mobile internet, including the distributed nature, highly dynamic, collection terminal diversity, node heterogeneity, etc. Then, we introduced a distributed network fault detection technology and design a distributed node monitoring framework for mobile Internet network traffic data collection mechanism. The framework contains the node failure detection and processing algorithms and load-balancing algorithm. It can implement real-time nodes monitoring and provide efficient fault detection mechanisms to avoid data loss. Meanwhile, in order to prevent overloading on some nodes, the algorithm balances workload pressure among the nodes. Experimental results gave us a positive feedback, our framework can achieve dynamic equilibrium acquisition node failure detection and rapid processing node loads, guarantee the reliability and integrity of the mobile Internet traffic data collection.3. Efficient algorithm for dynamic storage allocation in Cloud Computing Environment is achieved.Addresses the problem of improve the performance of Hadoop in heterogeneous environments. We study the evaluation methods of server performance and Hadoop technology. Then we present a method of evaluating node performance in cloud computing environment. Base on this method, we propose an algorithm for dynamic storage allocation that based on node performance evaluation. The algorithm introduces node performance parameter in storing data, and makes the data distribution associated with node performance. Experiments show that this algorithm can improve the ratio of data-local map tasks and shorten job runtime in heterogeneous Cloud Computing environment.4. MapReduce optimization algorithm based on dynamic performance inference in Heterogeneous Cloud Environment is proposed.Data processing and analysis is the most core functions in this platform. The efficiency of data processing is related to the performance of the massive network data processing platform. Therefore, performance optimization of data processing is a key problem we need to consider. Nowadays, most of the cloud computing clusters are built in stages and gradually upgrade the hardware. Meanwhile, the hardware update speed is very fast and this will inevitably lead to node performance differences in cloud computing cluster. Therefore, the existing cloud computing clusters are mostly heterogeneous. Hadoop is originally designed to be applied to homogeneous clusters, when the cluster goes heterogeneous, the performance of Hadoop will face a downward trend. Therefore, we studied and designed a MapReduce task allocation algorithm based on node performance dynamic inference.First, a machine learning module is introduced into MapReduce framework. This module is used to study job historical information and calculate the data processing capacity of each node in cluster based on time series analysis and machine learning algorithms. Second, based on the learning result, two aspects of optimization have been done:(ⅰ) reduce task assignment algorithm. The improved task assignment algorithm will assign Reduce tasks based on node performance to improve the job running speed.(ⅱ) Speculative execution mechanism. The new speculative execution mechanism will fully consider the performance and load of slave nodes before launching speculative execution in suitable nodes. This mechanism can avoid launching invalid speculative execution that results in cluster resources waste. Finally, our experimental results show that this algorithm can effectively improve the processing performance of heterogeneous clusters and cluster data stability, reduce waste heterogeneous cluster computing resources, improve the resource utilization of cloud computing platforms.5. Distributed anomaly detection algorithm based on Joint classifier is achieved Massive mobile internet traffic data exists in this platform. And its powerful computing resources and huge storage resources will become targets of intruders. Cloud computing resources access modes as well as its collaborative computing characteristics are conducive to intrusion, and it makes cloud computing security issues more worrying, especially in large-scale cloud computing environment. Meanwhile, cloud computing has the cross-regional, isomerization, virtualization and other characteristics. And the traditional intrusion detection technology has been unable to meet the demand for information security in the cloud virtualized environments. Therefore, it is necessary to use virtualization, cloud computing intrusion detection technologies to strengthen information security protection in the cloud computing environment. Based on the study of previous research work, we propose a cloud computing intrusion detection system based on joint classifier. This method combined with supervised and un-supervised classification algorithm, and can achieve a highly classification accuracy. Also with the MapReduce framework, we achieved these two classification algorithm based on Mahout Technology and it greatly improves the efficiency of abnormal traffic detection. Finally, experiments show that the algorithm can efficiently detect the massive network traffic and maintain high detection accuracy. And it can effectively improve the security of our cloud platform.
Keywords/Search Tags:Mobile Internet, Flow Data, Hadoop, Data Processing
PDF Full Text Request
Related items