With the burst of internet technology,massive data streams are being produced on many different application fields in an unprecedented speed,such as e-commerce,social network,intelligent transportation,sensor network and so on.Compared with traditional static offline data mining,data mining technology on stream raises new de?mands on processing model,state management,and load balance,etc.The requirements are as follows but not limited to these:1.scalability:algorithm should be able to adapt to fluctuation in data traffic and automatically expand;2.state management:stream in-formation reqiures efficient method to maintain and update;3.load balance:algorithm needs a load balance strategy to keep it's performance stable.In this paper,we do some work on frequent pattern mining over massive data stream,which are several shortbacks or disadvantages in this application field.We propose a new Balanced Parallel Frequent Pattern Mining over Massive Data Stream(BPFPMS)to solve those problems and do some experiments to prove the improvement of our algorithm,our main works are as follows:1.To solve the problem of state management,we propose a DPTS-Tree model to represent massive data stream,on which can perform efficient maintainance and update.Compared with traditional methods,DPST-Tree can take the full ad-vantage of historical information to reduce the load of maintainance and update.Experiments prove that it has a good advantage in terms of memory usage,status update,etc.2.To solve the problem of load imbalance that often occurs in traditional paral-lel frequent pattern mining algorithms,we propose a new dynamic load balance strategy.Compared to the traditional static threshold setting methods,dynamic load balance strategy shows a better performance over massive data stream,as it takes the full advantage of load information within sub-tasks.At the same time,it also meets the dynamic uncertainty brought by the real-time changes in the data flow status.Experiments show that it can achieve load balance to a certain ex-tent and ensure the performance stability of BPFPMS under large-scale data flow scenario.We implement BPFPMS based on the popular big data processing platform Spark and Kafka,a distributed message system.At the same time,we test its performance in terms of throughput,speedup,load balancing,and latency,etc.Experiments show that BPFPMS has good results in terms of state management,load balancing,etc. |