| With the popularization of network and information technology,people’s consumption habits are changing.Benefited from convenience of internet,online shopping becomes more and more popular,leading to more massive data in postal service.It brings new challenges to the enterprise.Traditional system based on relational database becomes less efficient in handling the massive data.Moreover,as the amount of data grows,the expansion of such systems is costly.Therefore,the goal of this topic is to build a new delivery data analysis system.The main contents of this thesis include the following aspects:(1)The author had learned about the existing large data processing technology.The existing large data processing technology is dominated by batch calcula-tion and flow calculation.Through comparison and analysis,Hadoop and Storm are selected as the core components of this system for analysis and calculation.The author has learned the principles of Hadoop,Hive,Flume,Kafka,Storm and other open source software.The author obtained a good understanding of the two modes of large data processing technology.(2)The construction goals of the system is proposed based on the analysis of the existing "safety supervision" system bottlenecks.The bottlenecks of existing "safety super-vision" system is mainly confined in the performance of some complex analysis.When the amount of single data reaches 5000(million),the analysis time of more complex query statistics are more than 500 seconds,the system is timeout.The goal of this system is to build a new system to overcome the bottleneck of "safety supervision" system.(3)The system is achieved according to the architecture designed by the author.The system includes four modules: data acquisition module,data preprocessing module,data storage and analysis module and data display module.The data acquisition module is the basis of the system construction.The "security supervision" system uses Log4 j record system log.The large data analysis system uses Flume to collect the log files and write them to HDFS.Facing the structured data in relational database,the system uses Java programs to extract data regularly.Real-time data collection,the system uses Flume Log4 j log messages directly into Kafka.The data preprocessing module is one of the important components of this system.This system transforms the data into "clean" reliable data through the data preprocessing module.The data storage and analysis module is the core of this system.According to the business requirements,this thesis uses Hive,Map-Reduce and Storm three different data analysis and processing technology to analyze the data.The data display module is a result display of this system.This module uses the mainstream J2 EE architecture and MVC programming mode to design and implementation,to provide a friendly interface display to users.(4)Set up the system environment,test and verify this system.The author built a 20-node machine Hadoop cluster and a 5-node machine node for the Storm cluster and tested the system.Experiments show that,when the size of the single data more than 5000(million),the analysis time of this system reduced to about 100 seconds,fully meet the design requirements. |