Font Size: a A A

Research On Passenger Transport Data Quality Detection And Missing Data Imputation

Posted on:2019-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:M J SunFull Text:PDF
GTID:2428330545974862Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It is inevitable that abnormal data and missing data will appear in data management and analysis with the exponential growth of data levels in the data era.To ensure the validity of data mining and knowledge discovery results,abnormal data detection and missing data imputation is especially important.Some existing algorithms are used to detect abnormal data and missing data,and an improved algorithm and a continuous missing data imputation algorithm based on non-negative matrix decomposition for sequence missing data are proposed for imputing missing data.Passenger data was used as the research object for experimental verification.The main work of the thesis is as follows:(1)There are the data anomaly and incompleteness in the passenger transport data.The use of hierarchical clustering to detect negative small values and discrete missing values,and the use of sliding window-based detection algorithms to detect consecutive missing data for subsequent missing values imputation provide data foundation.Experimental results show that the detection accuracy of the negative value is 100%,and the accuracy of the missing value is 89.7%.The accuracy of the continuous missing data detection algorithm based on the sliding window was 93.5%.(2)There are some problems about low imputed accuracy and high computational complexity of the traditional Biclustering-based Missing Data Imputation Algorithm(BMDI),a Biclustering-based Missing Data Imputation Improved Algorithm(BDMDII)is proposed.When searching for clusters,the improved algorithm introduces the row and column protection rules to avoid the problem that the cluster size is too large and its information volume is redundant.Formulating filling weight function to improve the accuracy and setting the maximum mean squared residue reduces the computational complexity.The results show that the improved algorithm is 45.7% more accurate than the original algorithm,and its running time is reduced by 10%.(3)The low accuracy of imputing continuous long sequence missing data still remain in the BDMDII algorithm,and then Sequence Missing Data Imputation Based on Nonnegative Matrix Factorization(NMF-SMDI)is proposed.The NMF-SMDI algorithm introduces a non-negative matrix factorization algorithm to decompose the missing sequences into discrete ones in time period based on the passenger transport data of the time periodicity features,and uses the BDMDII algorithm to fill the missing data.The results show that the NMF-SMDI is 18% more accurate than the BDMDII algorithm when the continuous missing data length is fixed and the missing rate is 30%~50%.The NMF-SMDI algorithm is improved by 24.6% compared with the BDMDII algorithm when the missing rate is same and the consecutive missing length is greater than 4.
Keywords/Search Tags:Anomaly detection, Missing data imputation, Bicluster, Non-negative matrix factorization
PDF Full Text Request
Related items