Font Size: a A A

Research And Application Of Data Quality Based On Unsupervised Anomaly Detection Algorith

Posted on:2023-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:F F WeiFull Text:PDF
GTID:2568306815959299Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
It is a very meaningful and practical work to make a reasonable and multi-dimensional assessment of data quality.In this paper,the unsupervised anomaly detection method is used to detect abnormal data,and the density clustering algorithm(DBSCAN),the isolated forest algorithm(i Forest),the long-short-time-variational autoencoder(LSTMVAE),and the Gaussian mixture clustering algorithm(GMM)are selected as the Anomaly detection research algorithm,and further research on data quality.The main research results are as follows:(1)In the data anomaly detection,in order to further improve the detection accuracy of the isolated forest algorithm,an improved isolation forest algorithm with adaptive threshold is proposed to remove redundant isolated trees.,to improve the detection accuracy.The algorithm arranges the created isolated trees in ascending order from left to right according to the AUC value.Considering the distribution of AUC values ??of the right sub-forest of the isolated forest by quantiles after sorting,an interval set search algorithm is proposed.Optimal search is performed to determine the threshold of forest division and obtain a better sub-forest.(2)Integrate four unsupervised anomaly detection algorithms: density clustering algorithm,isolated forest algorithm,long-short-time-variational autoencoder,and Gaussian mixture clustering algorithm.After data preprocessing,the fusion algorithm is applied to detect abnormal data.(3)Establish data quality indicators.For the problems of imperfect measurement indicators,insu cient intuitive expression and reasonable evaluation in data quality research,integrate multiple unsupervised anomaly detection algorithms to construct eight data quality evaluation indicators to achieve data quality improvement.Comprehensive quantitative assessment from multiple perspectives.(4)Based on the relevant data quality measurement indicators constructed,construct the risk index value of data quality,and give an early warning to the data of poor quality;secondly,construct the field correlation degree of multiple tables,and automatically identify the correlation degree that can be used as a multi-table joint query primary key.
Keywords/Search Tags:Data quality, Isolated forest, Density clustering, Risk indicator value, Correlation
PDF Full Text Request
Related items