Font Size: a A A

Outlier Detection And Robust Estimation Of Linear Regression Models

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y N SongFull Text:PDF
GTID:2370330626961120Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The data driven statistical models often lack stability due to the influence of outliers,which makes outlier detection and robust estimation particularly important in model construction.Outliers are generally divided into two types,that is,the anomaly on the response value Y and the anomaly on the predicted value X,the former is usually called the vertical outlier and the latter is often called the high leverage point.This article studies the commonly used outlier detection and robust estimation methods,focusing on the problems of outlier detection and robust esti-mation in linear regression models,and the normality tests under high-dimensional(multivariate)data is also analysedIn the first part,we use the concept of the hyper ellipsoid contour in the residual space to construct an improved method for the existing outlier detection methods and obtain the robust regression parameter estimates.First,we use the marginal correlation based High-dimensional Influential Measure(HIM)and the distance cor-relation based outlier discriminant method(HDC)to preliminarily screen outliers The points in the data set are divided into two types:normal points and abnormal points.Then based on the initial normal point set,the robust Least Trimmed Square estimation(LTS)method and the hyper ellipsoid contour of the residual space are used to construct a method for correcting the misjudgment points of the initial nor-mal point set,and the outlier probability of each point in the initial abnormal set is calculated.By further correcting the normal points misjudged into the abnormal point set,the accuracy of outlier detection is further improved eventually.By simu-lating three types of abnormal data under two data structures and analyzing some real examples,we proved the effectiveness of the method,and a relatively robust estimation of the regression parameters is obtained while detecting outliersThe complexity of high-dimensional data has gradually reduced the computing efficiency as the dimension increases.In the second part,we explored several com-monly used high-dimensional data reduction methods,and used them to analyze the outlier detection effects of HIM,HDC and MIP,a multiple influence point detection method,after dimensionality reduction.It was found that reducing the dimensions of high-dimensional data before performing outlier detection not only improves the calculation efficiency,but also maintains the original detection accuracy.Based on the Principal Component Analysis(PCA),we also construct a normality test method for high-dimensional(multivariate)data.As the Principal Component Analysis(P-CA)can project high-dimensional data to several low-dimensional orthogonal spatial directions with the strongest data interpretation,according to the statistical prop-erty of joint probability density function is the product of independent marginal probability density functions,combined with the Jarque-Bera(JB)statistic,we use the summation and maximum method to construct the statistics that integrate the information of skewness and kurtosis in the direction of each principal component to test the normality of high-dimensional(multivariate)data.Through experimental simulations of normal and non-normal data,we get the empirical errors that con-verge to a given significant level and the empirical power which is approaching to 1 The obtained normality test method is verified on two actual data setsFinally,we extend the improved outlier detection algorithm and robust esti-mation ideas obtained in the first part in high dimensions.First,based on the preliminary removal of the detected outliers,a marginal correlation based variable selection method(SIS)can be performed to select features on high-dimensional(ultra-high-dimensional)data.Secondly,after removing redundant variables,the robust high-dimensional sparse estimation method(Sparse LTS)is used to further obtain preliminary variable selection results and robust sparse coefficient estimation as well as the scale parameter estimation of the corresponding residuals;then,based on the updated normal set,the marginal correlation based SIS variable selection method and the sparse estimation method of Smoothly Clipped Absolute Deviation(SCAD)is used to further select the variable and estimate the sparse coefficients of the selected model.While obtaining the robust scale parameter estimates of the cor-responding residuals,according to residual ellipsoid contour and the error correction ideas proposed in the first part,the accuracy of outlier detection in high-dimensional linear regression models is improved,and finally the robust sparse regression coef-ficient estimation is performed after removing strong influence points on the data set.By comparing the different methods of dimensional reduction and variable s-election,we find relatively optimal high-dimensional outlier detection strategy and robust sparse estimation methods.The high-dimensional simulation data and a real example is performed to verify the effectiveness of the method.
Keywords/Search Tags:Linear regression, Outlier detection, Robust estimation, Data reduction, Variable selection, Normality testing
PDF Full Text Request
Related items