Font Size: a A A

Research And Implementation On The Key Technologies Of Fault Management In Large-Scale Data Center Network

Posted on:2016-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y H HuFull Text:PDF
GTID:2348330536467735Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data center is an important infrastructure of Internet,which provides the support of data storage,computing and transmission.With the rapid development of network technology,especially the arrival of the era of big data,data center that is in Web services,search engine,e-commerce,social networks,online games and large-scale cluster computing and other areas plays the key role increasingly apparent.In large-scale data center network,the application of data center network is strict with the requirements of network performance,it need manager can be more real-time,accurate grasp of the whole network situation and the end to end communication performance.After the network performance degradation,it can discover,locate the fault and solve network congestion.In this paper,we propose three new methods to study the efficient network fault diagnosis on Data Center Network:1.Data center network has the characteristics of large scale,strong heterogeneity,and fast data change and traffic complexity.It is difficult to meet the needs of network management by using the existing data acquisition method.A distributed data acquisition method that can obtain the data information in the network with a smaller data collection is proposed.On the basis of the distributed data acquisition method,Simultaneous Adaptive Distributed Data Acquisition Method is proposed.It takes adaptive acquisition strategy and simultaneous multi-threading mechanism.This method that adjust the data acquisition cycle adaptive according to the data center network data change and adjust the granularity of simultaneous threads according to the network scale greatly improves the efficiency of data acquisition.2.Aiming at the problem of large amount of data and redundant information in the data center network,an alarm correlation analysis method based on redundancy reduction mechanism is proposed.In this method,the data of the collected alarm event are normalized,and then the topological correlation of alarm data is analyzed.According to the topological correlation,the alarm data is divided into several different topological groups,and analyze each topological group by temporal correlation and calculate correlation degree.The purpose of the fault location is determined by the correlation degree.Ultimately determine the root cause of the failure alarm.3.In the face of the large number of relevant information in the data center network,it is a challenging problem that how to get the most likely failure set of network anomalies.A fault classification method based on the Bayes which has self-learning mechanism is proposed in this paper.The Bayesian Classifier firstly is trained,and then classifies the network faults by using the classifier.If have the wrong fault in the classification process,the self-learning mechanism according to the fault property toestablish a new type of fault to fault type library,and constantly improve the classifier,improves the classification accuracy4.We design and implement a prototype system of fault management for data center network.The system that use a hierarchical architecture based on Spring management framework,from top to bottom including visual layer,decision layer,network perception and resource layer.A large amount of data on the network adopt the above algorithm for analysis and fault diagnosis,provide real-time and accurate full network failure and performance view,and through the visualization tool to show the user friendly way.The system is deployed in the Tianhe'2 network fault monitoring,and the application effect is good.
Keywords/Search Tags:data center network, InfiniBand, data collection, redundancy reduction, fault classification, network fault management
PDF Full Text Request
Related items