Font Size: a A A

Design And Implementation Of Multi-source Data Acquisition And Analytic System

Posted on:2018-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:T Y LanFull Text:PDF
GTID:2348330512988035Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, network security is becoming more and more serious. A large number of network attack monitoring data are collected and analyzed,which are recorded in the form of text.It's an impossible task to analyze them in manual method. Therefore, it is urgent to set up an automatic analysis system for locating, statistics, dimension calculation and so on.The system of this paper is a data warehouse system, there are two major categories of data: raw data and IP address Library (referred to as "IP library"). These two categories of data have a "multi-source" attribute. The multi-source attribute of the raw data is that they are generated in different collection systems, which have different types and different formats; and the multi-source attribute of the IP database data is reflected in the three layer IP library model. The raw data contains the basic attributes: SourIP(source IP, the attacker) and DestIP (destination IP, the attacker), and then IP library data is to locate them. This is the core function of the system.In the face of vast amounts of data, which require fast and accurate dimension analysis, this paper describes why do I choose a composite structure of distributed system (Apache Hadoop) and relational database (SQL Server) , and how to use these technologies to build data warehouse. Exactly, how to do ETL modeling and implementation.The first is the ETL of raw data. After collecting the original file and loading into Hadoop HDFS, call API to extract data from HDFS to the Hive data warehouse, and the procedures of Map-Reduce process,clean,merge the data of various formats, finally generate "consistent" data, whose model is called "five tuple model".Followed by the ETL of IP database data. Five-tuple data in the form of text are transferred to SQL Server. In addition to load these data,SQL Server will load the IP library data which is an important dictionary data. This paper describes how to build a"three tier IP library model" to meet the different attention of IP precise positioning.Each layer of IP libraries are associated with the national administrative divisions,which contains at least three levels - province (municipality), municipality (district),district (county) -of the geographical division. It is a good part of The ETL workload to reorganize the collection of the IP Library to the three levels.After building the data warehouse, using B/S architecture to build web site which called T-SQL (SQL based programming language provided by SQL Server) to query and analyse the contents of the data warehouse. And the web UI(user interface) also complete business functions of the system: user manipulation, user permissions, user management, data visualization etc.. And it also provide ad-hoc queries and provide necessary retrieval,show trends, statistics,charts and data,and provide exporting of reports.
Keywords/Search Tags:multi-source data analyse, IP library design, ETL, Hadoop, T-SQL
PDF Full Text Request
Related items