Font Size: a A A

Research And Implementation Of Data Cleaning Method For Traffic Card

Posted on:2021-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:X S ZhangFull Text:PDF
GTID:2392330611967492Subject:Control engineering
Abstract/Summary:
Nowadays,with the concept of "green travel,low carbon and environmental protection" gaining popularity,the vigorous development of the public transportation industry and the rapid promotion of the national traffic card,the circulation of transportation cards is increasing.At this stage,the city traffic card has generated massive amounts of data.If these scattered data can be aggregated to dig out valuable information and use these valuable information to solve the bottleneck problem in the public transportation field,it is the significance of the application of technology to the field of public transportation.Due to the influence of various complex factors,the collected traffic card data has more or less data quality problems.If these problems are not paid enough attention,it will produce inestimable results for later data analysis and data mining.Impact,thereby greatly reducing the reliability of the data.Therefore,it is of great significance to study the data cleaning method of traffic card.There are usually data quality problems such as data loss,duplicate records,and data errors in the traffic card data.The focus of this article is the outlier data in the traffic card data.Due to traditional small-scale traffic card data cleaning,storage and analysis techniques are no longer suitable for processing massive amounts of data,this paper introduces the Spark distributed computing framework.Using the characteristics of Spark based on memory computing,it can greatly improve the efficiency of data cleaning in a big data environment.Aiming at the characteristics of large data volume,many data attributes,and various data types of the traffic card,a K-means clustering algorithm and LOF algorithm are combined to detect outliers,that is CLOF algorithm.Considering that the calculation process of the LOF algorithm is very complicated,the data set must be traversed continuously when calculating the outliers of data objects,in fact,most of the data objects in traffic card data are not outliers.Therefore,this paper uses the k-means clustering algorithm to classify the data set first,and then prunes the data points with low possibility of outliers around the center of the class.The purpose of this is to first remove a part of the data set that does not contain outliers,and then to calculate the local outliers of the remaining suspected outliers,so that the calculation time can be greatly reduced.Based on this,this paper finally puts forward the data cleaning scheme of the parallel CLOF algorithm on the Spark distributed cluster.Experiments show that under the same experimental conditions,the detection accuracy of the CLOF algorithm is improved compared with the classic LOF algorithm,and the error rate of the CLOF algorithm is lower in terms of algorithm detection error rate;In addition,the CLOF algorithm is far less than the LOF algorithm in the running time of the algorithm.At the same time,the experiment verifies that the Spark distributed cluster is superior in processing large data sets,and the parallel CLOF algorithm on the Spark distributed cluster has strong scalability.
Keywords/Search Tags:Traffic card, Data cleaning, Outliers, Spark, LOF
Related items