| In this information age,there are a lot of data generated by the Internet everyday,including text,audio,video and so on.Facing the challenge of the times,what we need to figure out is how to extract useful information from all these data with efficient data mining methods.The Support Vector Machine(SVM)is a widely use classic supervised learning method.When combined with kernel function,SVM can obtain a nonlinear model with better accuracy for nonlinear separable data.However,its time cost makes it unsuitable to handle the large-scale datasets,and it is difficult to parallelize the training procedure with traditional mothod.After 8 years of evolution,Apache Spark has become one of the most important tool in the field of the big data processing.In this paper,we introduce a kernel inner product filtering based method to solve the original quadratic programming problem by decomposing to enable the parallel training.The Multiple submodels Parallel Support Vector Machine(MSP-SVM)on Spark.Experiments show that MSP-SVM can make use of the spark cluster to speed up the training efficiently.Compare with MLlib-SVMWithSGD,MSP-SVM can achieve the accuracy closed to the LIBSVM only with reasonable overhead.Currently,the Spark framework is mostly used to process the text-format data.It has less application for other non-text data like video.With the explosive growth of Internet video content,the demand for analyzing and processing the large-scale video data gradually emerges.In the field of image processing,deep learing has shown unprecedented advantages in pattern recognition and feature extraction.Based on this,this paper presents a large-scale video processing method on Spark.By serializing video into frames,we introduce OpenCV and CaffeOnSpark to support the video processing and image feature extraction.We also apply all the methods in this paper to the face recognition and expression recognition.Then we integrate them as component into our big data analysis platform. |