Font Size: a A A

Research On Distributed SVM Algorithm Over Data Stream

Posted on:2018-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y HouFull Text:PDF
GTID:2428330623950968Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Support Vector Machine(SVM)is a classical learning algorithm based on statistical learning theory.It has some advantages,such as complete theoretical foundation,good generalization performance and the ability to solve high-dimensional nonlinear problem.It is widely used in image processing,text classification,speech recognition,anomaly detection and many other areas of classification prediction.Sensor network monitoring,fault prediction,intrusion detection and other online classification forecasting problem is increasing.Different from the traditional batch processing,the flow data has such characteristics as large amount of data,real-time arrival and dynamic change.The traditional batch SVM algorithm cannot directly meet the requirements of stream processing.The research on data stream-oriented SVM algorithm has important significance and practical value for online classification and prediction.Due to the large amount of data,the serial algorithm cannot meet the requirements of real-time and high efficiency.Therefore,the efficiency of the algorithm needs to be improved by parallelism.However,due to the high computational complexity and high degree of parameter coupling,the SVM faces the problems of low efficiency and limited improvement of speedup.In order to meet the dynamic access requirements of data centers in different regions,provide users with more efficient and high-quality services,and improve the reliability of data centers,more and more cross-regional data centers are set up.As cross-data center data processing faces the problems of high communication cost,unstable network environment,privacy protection and other issues,how to implement a low communication overhead cross-data center algorithm,which can avoid transmitting original data,has become an urgent problem to be solved.In this paper,we research the time overhead and communication overhead of SVM algorithms over data streams from three levels,including single-node level,multi-node level and cross-data center level.The main work is as follows:At the single-node level,SVM incremental learning usually uses historical data and new data together as an incremental learning training set update model.In the data stream environment,the data arrives in real time and the amount of data is unbounded.When using the traditional method for incremental learning,the support vectors will accumulate continuously,which increases the time overhead of the incremental learning process.To this end,this paper proposes a Reserving Partial Support Vectors Incremental Learning Algorithm(PSVIL)that preserves partial support vectors.Firstly,according to the distance from the support vector to the classification hyperplane,the support vectors are divided into several groups,and the number of support vectors that each group should select is calculated according to the ratio of the number of support vectors in each group.Then,by maximizing the sum of vector spaces within each group,we choose the support vector that is scattered as much as possible.The support vectors selected by each group are merged into a subset of the history vectors that represent historical information involved in the model update.Finally,the historical support vector subset is mixed with the new data to form an incremental learning training set,to train a new model and to complete the model updating.Experimental results show that the accuracy of the proposed algorithm is within 0.1% of the traditional one and the average time cost is reduced by 68%.At the multi-node level,the existing distributed SVM algorithms usually distributes the new data randomly to each node,and uses the packet data and global support vector to train the local model in each node.Then,the local support vectors trained by each node are summed up to the global node,As global node training set training global model.However,this method achieves a lower acceleration.Therefore,this paper presents a Uniform Data Partitioning Distributed SVM Algorithm(UDPSVM).The algorithm first divides the new data and historical support vector into each node,and then uses the new data subset and historical support vector subset to train the local model in each node,and then aggregates the parameters of each node's local model to get the global model.New data is divided first K-means algorithm is used to cluster each category of data,and each cluster of data is evenly distributed to each node,so that the new data distribution of each node is similar to the complete set of new data.The classification of historical support vectors is firstly grouped according to the distance from the support vector to the classification hyperplane and then the support vectors of each group are evenly distributed to each node so that the distance distribution of the historical support vectors of each node to the classification hyperplane and the complete set of historical support vectors approximate.Through the above data partitioning,the data distribution and global approximation of each node,so each node training local model approximate global model,through the local model parameters are averaged to get the global model.Experimental results show that the accuracy of the proposed algorithm is less than 0.1% and the speedup is 2-8 times faster than the existing algorithms.At the cross-data center level,this paper proposes a Geo-Distributed SVM algorithm(GDSVM)to solve the problem of high communication overhead caused by aggregating raw data into a single data center.Each datacenter trains the local model separately,and then weighted average each local model to get the global model,and each datacenter saves the global model.In the data classification stage,the newly arriving data is directly classified locally,the new data is saved and labeled,and the accuracy of the global model and the local model are respectively calculated.During the incremental learning phase,the data centers exchange the local classification accuracy,and adjust their weights in the global model according to the classification accuracy of each local model.The algorithm avoids large-scale original data transmission,reducing communication costs.Experimental results show that the accuracy of the algorithm is less than 0.5% and the average traffic is reduced by 52% in the way of aggregating the original data into a single data center.
Keywords/Search Tags:Data Stream, Support Vector Machine, Incremental Learning, Distributed Algorithm, Geo-Distributed Data Center
PDF Full Text Request
Related items