Font Size: a A A

A Study Of Distributed Sequential Pattern Mining On Massive Traffic Data Streams

Posted on:2012-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:S H ZhengFull Text:PDF
GTID:2248330395962373Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the fast development of economic construction and the improvement of people’s living standards in our country, the number of motor vehicles as one of the most important means of transportation, keeps growing rapidly in urban areas. However, more and more vehicle criminal activities come along at the same time, such as plate number fraud, motorcades, vehicle theft and vehicle escapes. Since motor vehicles are always with high mobility, concealment and other characteristics, it is always difficult to identify and track the suspect vehicles for public security and traffic police departments. Thanks to monitoring technology for urban roadway traffic, there are many monitoring sites have been deployed for collecting vehicle passing through records. The accumulated traffic data is well used for vehicle monitoring, tracking and prediction in crime cases. Technology of sequential pattern mining, which is an important research topic for data mining, can be employed to mine the valuable patterns from time related traffic data streams to provide a better decision or service for social and departments concerned. However, taking into account the massive traffic data storage and dynamic changing characteristics, the traditional sequential pattern mining algorithms cannot meet the requirement of fast searching and identification already. The emergence of distributed computing platform addresses the storage and calculation bottlenecks of massive data computing, which makes sequential pattern mining of huge amount of traffic data possible.As an easier, faster and more effective distributed platform, Hadoop uses distributed file system HDFS to achieve large file storage and fault tolerance, and uses MapReduce programming model for computing. Because the traditional sequential mining algorithms are only suitable for analyzing and mining the centralized data, how to design the effective mining algorithms adapted to the distributed computing platform becomes the key problem for achieving data mining on large datasets. Based on the characteristics of distributed platform, studying all steps of the sequential mining algorithm and enlarging parallelism of the algorithm as much as possible will help to improve the sequential mining efficiency on massive data streams. The paper made the integration of distributed sequential pattern mining and the traffic data streams applications. Thanks to the storage and computation characteristics of Hadoop distributed platform, it dissolved the shortage of traditional sequential mining on massive traffic data streams.At first, the paper analyzed the HDFS storage system and its read-write processes in detail. According to this, the paper realized the data pre-processing on traffic data streams. By effective data cleaning, transformation and reduction, the paper transferred and converted the massive traffic data streams from the traditional relational database to HDFS, which provided the efficient data format for the following realization process of distributed sequential pattern mining.Then, the paper described the MapReduce operating mechanisms, offered the new definitions of sequential pattern mining on traffic data streams and finally based on Hadoop architecture, it designed a distributed sequential mining algorithm on traffic data streams. The paper illustrated the general idea and detailed implementation process of the algorithm, including its strengths and weaknesses. Besides, according to its limitations on data excavation results, the paper parallelled the BIDE algorithm and transplanted it to Hadoop platform for satisfying more general and complete requirements.At last, Hadoop cluster experimental environment was built. By applying the algorithms to the identification of the motorcades based on massive traffic data streams successfully, not only did the paper verify the validity of the algorithms in theory, but also show their practical value on the test platform.In summary, the distributed sequential mining algorithms brought forward in this paper are feasible and meaningful. The good flexibility and extensibility, which were reflected by the algorithms well adapted to the Hadoop distributed computing platform, prove that it is necessary for the practical application to set up the distributed model to solve massive data mining problems. Furthermore, the results also play an important advisory role on future using of other sequential mining algorithms applied to the Hadoop distributed platform.
Keywords/Search Tags:Massive traffic data streams, Sequential pattern mining, Distributedcomputing, Hadoop, Motorcades
PDF Full Text Request
Related items