The Internet has brought great conveniences to people’s life which significantly changes our ways of living. With the growing popularity of the Internet and rapid development of network technology, the Internet is becoming the major communication tool and information sharing channel. However, with the growing scale of Internet users, the continuous development of new network technology brings both conveniences to network users and enormous challenges to network management.Traditional network analysis methods in the face of large-scale traffic may encounter serious difficulties: First, emerging network applications are using distributed architecture and encryption protocols on a large scale, and applying sophisticated communication process to ensure the completeness of business. This greatly changes the composition of Internet traffic and makes those traditional protocol identification methods hard to identify critical business and applications effectively; second, class imbalance problems caused by the large-scale network environment seriously affect the accuracy of those protocol identification methods based on machine learning techniques and reduce their practical effect; third, the Internet traffic expands rapidly and shows the characteristics of big data, which poses a serious challenge to the traditional network protocol analysis method.Therefore, based on the above backgrounds and relevant issues, we conduct our investigation in the following steps:(1) For the difficulties in identifying sophisticated applications, we proposed a traffic-aware method for application identification. Based on the analysis on the communication characteristics of sophisticated applications, this method can perceive the traffic generated by those sophisticated applications from three dimensions(i.e. time, space and traffic), and build an integrated model that can effectively identify the traffic from sophisticated applications;(2) For poor identification accuracy problem that seriously affects the identification accuracy in large-scale network environment, we proposed the SAIMM method based on the principle of Min Max and integration of classifiers. SAIMM is come up with the reason and mechanisms of serious false positives caused by higher scale between positive traffic and negative traffic in the network and solves the performance problem in idenfitying small class application traffic. While the overall identification accuracy is ensured, the identification performance of small class application traffic is improved;(3) For the difficulties in analyzing the big traffic data and mining the private networks of sophisticated applications, a flow-field method is proposed to abstract the hosts, servers, network traffic.etc. as nodes and communication behaviors. It uses flow-field mining and traffic association approaches to mine the massive network information, which constructs the private network of sophisticated applications, dissects the run-time mechanisms of private networks, and achieves superior network management and early warning;(4) Based on the result of my research in the above three aspects, we present Spider Web, which is designed and implemented as a traffic identification and analysis system for sophisticated applications, including a traffic preprocessing module, an application identification module, a massive log storage module, a flow-field mining module and a visual representation module. Our experiments show that Spider Web can effectively solve the identification and analysis problems for sophisticated applications in large-scale network environment.The proposed application traffic identification and private network mining technologies for sophisticated applications can solve a number of problems and challenges for sophisticated application identification and analysis in large-scale network environment. It can effectively enhance the network management capabilities of network operators. |