Font Size: a A A

The KNN Interpolation Method In The Perspective Of Statistical Distribution Information Based On Panel Data Of Listed Companies

Posted on:2024-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2530307115979699Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
The completeness of data affects people’s analysis and decision making,but there are often missing data in the real world.Missing data can lead to information loss,which reduces the validity and credibility of data analysis results,and may even produce misleading conclusions.Therefore,how to deal with missing data and play the important role of data is an unavoidable challenge in the fields of statistics and economics.Panel data of listed companies usually contain multiple time points and multiple variables,and the data structure is complex.Most of the existing interpolation methods interpolate the missing data for a single variable,which is difficult to deal with the interactions and effects between multiple variables.Therefore,in this paper,the panel data are considered as points in a statistical stream shape,and the geodesic distance of the stream shape is used to measure the proximity relationship between sample points.Considering the complexity of the analytical expression of geodesic distances,the approximate measures of geodesic distances are selected from three perspectives:polynomial stream shape,parametric hypothesis testing,and informativeness:Cosine distance,HotellingT~2statistic and Jensen-Shannon scatter.When considering the company’s financial data as random vectors obeying acertain distribution,three KNN interpolation methods based on statistical distribution information are proposed:(1)KNN interpolation method based on Cosine distance.The distribution of financial data can be regarded as points in a polynomial stream shape and the Cosine distance is chosen to measure the similarity between samples.(2)KNN interpolation method based on HotellingT~2statistic.It is assumed that the distribution of financial data of the same category of companies has the same mean value and the HotellingT~2statistic is chosen to measure the similarity between samples.(3)KNN interpolation method based on Jensen-Shannon scatter.The information measure Jensen-Shannon scatter,which is based on the amount of information,was chosen to measure the similarity between samples from the perspective of the edge distribution of each component of the random vector.In order to validate the interpolation performance of the KNN interpolation method under the three metrics,the paper performs simulated experimental analysis and empirical analysis.In the simulated experimental analysis,the KNN interpolation method under the three metrics is compared with the classical,first-class five interpolation methods,median interpolation,mean interpolation,missing forest,bagging method and traditional KNN interpolation,for datasets with different missing rates generated from random missing simulations.The experiments show that the missing forest interpolation is always optimal,while for the KNN interpolation under the three metrics,overall,the Cosine distance-based interpolation is slightly less effective than the missing forest when the missing rate is low;the HotellingT~2statistic-based interpolation is slightly less effective than the missing forest when the missing rate is high;the Jensen-Shannon scattering-based interpolation is less stable The Jensen-Shannon scatter based interpolation was less stable,but good in the case of high missing rates.In the empirical analysis,the real missing financial data of 121 listed companies under the financial sector in the wind database for a total of 15quarters from the first quarter of 2019 to the third quarter of 2022 are selected to compare the interpolation effectiveness of the single interpolation method under the three metrics and its multiple interpolation method combined with the three sample strip interpolation method.The experiments show that the accuracy of the three multiple interpolation methods is higher than that of the single interpolation method under the three metrics,and among the multiple interpolation methods,the multiple interpolation method with three splines and Jensen-Shannon scatter is the most effective.
Keywords/Search Tags:missing data, interpolation methods, listed companies, panel data, cosine distance, hotellingT~2 statistic, jensen-shannon scatter
PDF Full Text Request
Related items