Research And Implementation On The Key Technologies Of Fault Management In Large-Scale Data Center Network

Posted on:2016-02-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Hu

Full Text:PDF

GTID:2348330536467735

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Data center is an important infrastructure of Internet,which provides the support of data storage,computing and transmission.With the rapid development of network technology,especially the arrival of the era of big data,data center that is in Web services,search engine,e-commerce,social networks,online games and large-scale cluster computing and other areas plays the key role increasingly apparent.In large-scale data center network,the application of data center network is strict with the requirements of network performance,it need manager can be more real-time,accurate grasp of the whole network situation and the end to end communication performance.After the network performance degradation,it can discover,locate the fault and solve network congestion.In this paper,we propose three new methods to study the efficient network fault diagnosis on Data Center Network:1.Data center network has the characteristics of large scale,strong heterogeneity,and fast data change and traffic complexity.It is difficult to meet the needs of network management by using the existing data acquisition method.A distributed data acquisition method that can obtain the data information in the network with a smaller data collection is proposed.On the basis of the distributed data acquisition method,Simultaneous Adaptive Distributed Data Acquisition Method is proposed.It takes adaptive acquisition strategy and simultaneous multi-threading mechanism.This method that adjust the data acquisition cycle adaptive according to the data center network data change and adjust the granularity of simultaneous threads according to the network scale greatly improves the efficiency of data acquisition.2.Aiming at the problem of large amount of data and redundant information in the data center network,an alarm correlation analysis method based on redundancy reduction mechanism is proposed.In this method,the data of the collected alarm event are normalized,and then the topological correlation of alarm data is analyzed.According to the topological correlation,the alarm data is divided into several different topological groups,and analyze each topological group by temporal correlation and calculate correlation degree.The purpose of the fault location is determined by the correlation degree.Ultimately determine the root cause of the failure alarm.3.In the face of the large number of relevant information in the data center network,it is a challenging problem that how to get the most likely failure set of network anomalies.A fault classification method based on the Bayes which has self-learning mechanism is proposed in this paper.The Bayesian Classifier firstly is trained,and then classifies the network faults by using the classifier.If have the wrong fault in the classification process,the self-learning mechanism according to the fault property toestablish a new type of fault to fault type library,and constantly improve the classifier,improves the classification accuracy4.We design and implement a prototype system of fault management for data center network.The system that use a hierarchical architecture based on Spring management framework,from top to bottom including visual layer,decision layer,network perception and resource layer.A large amount of data on the network adopt the above algorithm for analysis and fault diagnosis,provide real-time and accurate full network failure and performance view,and through the visualization tool to show the user friendly way.The system is deployed in the Tianhe'2 network fault monitoring,and the application effect is good.

Keywords/Search Tags:

data center network, InfiniBand, data collection, redundancy reduction, fault classification, network fault management

PDF Full Text Request

Related items

1	Research On Fault Tolerant Storage And Data Access Optimization In Data Center Networks
2	The Design And Implementation Of Fault Management System For Data Communication Network
3	Research And Design On Fault Tolerance Mechanisms In Data Center Network
4	Research On The Key Technology Of Fault Tolerance Based On Fault Data Preprocessing For Supercomputing Systems
5	Research And Realization Of Network Fault Management System Based On Manager/Agent
6	The Research Of Redundancy And Fault-Tolerant Technology Based On Real-Time Operation System
7	Virtual Machine Fault-tolerant Allocation Algorithm With Controllable Redundancy
8	App Network Error Collection And Fault Location Analysis
9	Automatic Fault Detection And Elimination Of Network Management
10	Research On The Key Technology Of Scalable Data Center Network Interconnection