Font Size: a A A

Anomaly Detection Of Large Data In Time Series Based On Hadoop Platform

Posted on:2019-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:T X ZhangFull Text:PDF
GTID:2428330566469770Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Time series is a very common sequence in daily life and work.In recent years,the rapid development of new technologies,such as sensor network,Internet of things,cloud data center and mobile Internet,has made time series data explode,and time series data have some special characteristics.First of all,the scale of time series data is very large and the form of data flow is constantly produced.Its calculation is faced with the problems of high frequency,too long sequence and too large amount of data.Secondly,the time series data have the characteristics of high dimension and diverse features.The index division accuracy and processing efficiency need further improvement.Therefore,more and more attention has been paid to the research of time series.In order to detect abnormal time series,scholars have proposed some statistical models and data mining methods to find outliers by comparing the correlation between the sequence values.The commonly used linear models are two models of Auto Regression Moving Average(ARMA)and Autoregressive Integrated Moving Average(ARIMA),of which ARMA model is mainly applied to stationary time series,and ARIMA model is mainly applied to nonstationary time series.The commonly used nonlinear models include Hidden Markov Model(HMM)and Artificial Neural Networks(ANN)model.However,with the increasing number of time series data,the existing non distributed time series outlier detection methods have begun to show low efficiency.The host communication traffic data studied is a time series,which aims at the low efficiency of abnormal detection caused by the large amount of data.This paper uses the Hadoop distributed platform to solve the problem.First of all,the ARIMA model is used to train the data of the training set on a single machine,and the model is optimized by double sliding window and residual method to improve the accuracy of the model,but the expected effect has not been achieved.Then,HMM is selected on the single machine.On the basis of the original model,the problem of the algorithm underflow,the oversize of the probability transfer matrix and the (|)value are too small to be optimized.The optimized HMM is used to train the training set,and some parameters are adjusted to improve the accuracy of the model according to the training results.The experimental results show that the accuracy of HMM is better than that of ARIMA model.As the basic algorithm of anomaly detection,HMM reduces the complexity because it does not need to optimize the exception point for each type,and also has certain detection ability for the unknown outliers.This paper uses distributed Euclidean distance algorithm,distributed ARIMA optimization model and distributed HMM optimization model to detect abnormal test set data.In order to compare the differences of distributed algorithms,a comparative experiment is designed and implemented.The related experimental results show that the HMM optimization model based on Hadoop is more accurate in the face of massive traffic data.Finally,this paper provides a domestic bank to provide a feasible scheme for traffic anomaly detection under large data,and set up a visual flow monitoring platform for it.
Keywords/Search Tags:big data, Hadoop, ARIMA, anomaly detection, HMM
PDF Full Text Request
Related items