Font Size: a A A

Research On Log Data Real-time Processing Based On Storm And Hadoop

Posted on:2018-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2348330536473557Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The log data contains the rich information of the system and the network user behavior,so it has high practical value in network management,user behavior analysis and other fields.In the era of big data,the scale of log data generated in unit time has grown exponentially.Diversity,heterogeneity and dynamic change of log data presents a challenge to the log data acquisition,storage and depth analysis.Traditional log processing is based on a single node server,there is no scalability,single node is very limited in the CPU,I/O and storage performance.At present,the response time of log data analysis is more and more high in the practical application,and the real-time performance and the high throughput parallel computing for large data have become the basic requirement of log data processing.In the realtime processing of the application scenario,the stream computing processing can complete the realtime processing of the log stream data,and extract the knowledge for a small scale data set in a certain period of time.However,the limitation of the data limits the applicable algorithms and the reliability of results.Therefore,the knowledge extracted and dependent in real-time computation needs to be combined with the analysis results of off-line batch technology for large scale off-line data.For the main problems in the collection,storage,analysis of the rapidly growing log data and the knowledge extraction of offl-ine data and real-time data and their integration under the background of information and big data,this paper did research on the theory and practice of big data technologies.On the basis of distributed system infrastructure Hadoop,it built a real-time processing platform log data from a resource scheduling level integrated MapReduce and Storm two different computing framework by Storm On YARN.It used Flume and HBase to complete the data collection and storage of log data,and used the large throughput MapReduce to complete the global knowledge extraction of large-scale off-line data.And it used Storm to carry out the burst knowledge extraction of small and medium scale data in Kafka buffer,and combined the global knowledge to carry on the real-time computation of the stream data,guaranteed the real-time performance and improved the accuracy.The main contents and results of this paper are as follows.(1)Research on real-time processing platform for log dataIt has built a log data real-time processing platform architecture with three layers structure,including data service layer which is responsible for data collection and storage,the business logic layer for data analysis and the Web presentation layer for data visualization.It used the shared knowledge base to realize the combination of off-line analysis and real-time analysis.And it integrated Hadoop,Storm,Flume,HBase and Kafka and other large data components to achieve the overall architecture of the distributed cluster environment.(2)Distributed collection and storage of log dataBy using Flume,the log data obtained from the multi source front-end server were stored in the distributed database HBase almost in real time,and used the pre partition and RowKey random hashing technology to optimize HBase.The experiment results showed that the platform effectively completed the front-end server log data near real-time acquisition and storage.Compared with before my works,the optimization of the HBase cluster in the log storage process made full use of the I/O and CPU resources in the cluster,load balance.It could effectively solve the "hot spot" problem of HBase.(3)Depth analysis of off-line log data based on MapReduceCombined with the MapReduce computing model,the traditional data mining algorithm was parallelized,and it was transplanted to the platform to realize the knowledge extraction of the historical log data in HBase and stored the knowledge in the off-line knowledge base.For the practical application,it carried on the parallel processing of K-means and Apriori in the MapReduce distributed environment to complete the clustering analysis and association rules analysis.The experimental results showed that the platform could effectively extract the knowledge of high reliability from historical log data.By using MapReduce parallel technology,the deep analysis could get higher operating efficiency and expansibility,and it fully meets the application requirements of large-scale log data knowledge extraction.(4)Real-time computation of log stream data based on StormIt has integrated Storm and Kafka to achieve stable access to real-time computation of log stream data sources.The traditional data mining algorithm combined with the Storm model was used to extract the knowledge of small scale real-time data in a certain time window,and the information in the shared knowledge base was used as the decision support to carry on the Storm real-time flow computation to the log data,completed the combination of off-line computation and real-time computing.For the practical application,it mixed K-means,KNN and other algorithms to complete the network anomaly identification.The experimental results showed that the platform could effectively extract the unexpected knowledge in real-time data,and achieve high precision real-time continuous computing relied on a shared knowledge base.The application of Storm technology makes the real-time analysis to obtain higher real-time performance,and shows significant advantages in streaming data processing.In summary,the log data real-time processing platform constructed in this paper could effectively solve the problems of data collection,storage and knowledge extraction,and combine the Hadoop high throughput and real-time advantages of Storm.It used MapReduce to extract the global knowledge hidden in the historical log data.Based on Storm,it extracted the burst knowledge in small scale real-time log data,and use Storm traditional stream processing to carry on the real-time continuous computation to the real-time log data combined with the extracted knowledge.It could provide a new technical reference for the construction of log data acquisition,storage and analysis system,and has a certain rational and promotional value.
Keywords/Search Tags:Log Data Real-time Processing, Hadoop, Storm, Flume, HBase
PDF Full Text Request
Related items