In the public traffic business,massive and diverse bus IC card data have been generated,and it is the key point of the intelligent transport to provide quick and accurate passengers flow analysis of bus stations,and the service of bus transport speed between adjacent bus stations.In the past,the study of passenger flow is just a simple data statistics;According to the site to attract and other methods to predict the site transfer passenger flow,but it lack of a large number of travel records as its data support,so that the accuracy is not high.The research on travel time and speed of bus stations is mainly focused on forecast based on the small sample data,and the parallel algorithms do not have the ability in the data scale expansion.This paper carries out data cleaning on the massive bus credit card data based on the related work and summarizes the existing research results.On that basis,according to the spatio-temporal characteristics of the data,we do the analysis and research on the passenger flow in bus stations and bus speed between bus stations,and the proposed method of calculation and analysis is carried out on Hadoop Mapreduce.The specific research work is as follows:(1)In the aspect of bus data cleaning,a time-based clustering analysis method is proposed for the data of the original bus IC card with time and space attributes.The time range of data is judged according to the time consistency principle,and acoording to the city where the data is located,such as Beijing traffic operation situation,to adopt a rule-based filtering strategy to modify and remove the abnormal data,so that to provide data support for the subsequent analysis of the data.(2)In the passenger flow on bus stations,this paper mainly analyzes the getting on or off passenger flow and transit passenger flow on bus stations.For the former one,i proposed a bus credit card time clustering method in the large data environment.With cluster analysis of each trip of credit card data,we could determine the attribution time of getting on credit card data or getting off credit card data.On the basis of the data after cleaning,we could get the getting on and off amount of credit card data at all stations in different periods of time through two calculations.In the site for passenger traffic,through space and other conditions of the constraints to determine whether there is transfer behavior,and then get the site in different periods of passenger traffic.(3)In the aspect of reflecting the passenger capacity of the adjacent stations,the paper analyzes the speed of the bus arriving at the adjacent stations,and puts forward the method of calculating and analyzing the bus at the station time and the departure time in the large data environment.On the basis of the bus data after cleaning,according to the above method,calculate the travel time of the bus in the adjacent station interval and the inter-station bus speed,we could get all the adjacent stations in different directions,all the bus speed in different periods of time through two calculations.In this paper,an experimental environment is set up.On the Hadoop platform.HDFS could realize the large file storage and MapReduce programming model is used to collect and deal with large-scale data.By a large number of experiments,this paper verifies its feasibility,accuracy and extensibility of above calculation and analysis method in large data environment. |