Font Size: a A A

The Online Imputation Method Of Missing Value Based On KNN And Its Application In Credit Evaluation

Posted on:2021-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:S L QiuFull Text:PDF
GTID:2518306122469734Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of cloud computing,Internet of Things and other technologies,China's credit business is developed rapidly.How to comprehensively and accurately assess the customer's credit risk and develop personalized credit financial services is not only the core link of risk control of traditional financial institutions such as commercial banks and microfinance companies,but also the difficulty of emerging Internet business operations.The non-performing loan ratio of financial institutions has forced various financial institutions to continuously improve their risk management level.In the past,most of the credit evaluation research focused on the design of credit evaluation models,while ignoring the research on preprocessing of incomplete data sets.In fact,with the deepening of customer credit evaluation theory and practice research,researchers found that customer data often contains many missing data,and to a large extent affects the effectiveness of customer credit evaluation.Secondly,many of the credit business data today are streaming data that arrives dynamically.The data can only be accessed once.So,how to use the limited computing resources to quickly process the missing data in real-time data flow A topic worth studying.K-nearest neighbor algorithm is an online technology that can make full use of the information brought by the new data,but in the case of a large amount of data,the search of its neighbors has a high time complexity.And then European Distance is usually applied to measure the similarity of the sample,it considers that all attributes are equally important,which easily leads to the misleading of the imputation effect by irrelevant attributes.Out of these two considerations,in order to provide a new way of filling the data missing in the credit evaluation data,this paper proposes a K-nearest neighbor imputation algorithm that combines maximum information coefficient and online hierarchical clustering tree.The main framework of this method consists of two key components: one component consists of construction of a hierarchical clustering tree,and the other is application of the k-nearest neighbor.Both of these aspects depend on the similarity measure between samples.This article uses an approximate Euclidean distance,that is,the concept of the minimum boundary rectangle is used to represent the spatial range of all data points contained in the internal nodes in the clustering tree that ensuring efficiency of data processing with certain accuracy.In addition,the process of receiving and processing the streaming data points is performed synchronously,and the results of imputation are fed back in real time.It is not necessary to wait for the data in the online data stream to be stored in memory before imputation of the missing values.At the same time,in the process of imputation data,the nonlinear correlation between attributes are considered by applying the maximum information coefficient to measure the relative weight between attributes,and then the information of data with high correlation with the data to be estimated will be fully excavated.Thereby reducing the impact of irrelevant data on the imputation of missing data sets and improving the accuracy of data imputation.In order to verify the accuracy,efficiency and dynamicity of the imputation method proposed in this paper for customer credit missing data,this paper conducted a comparative experiment through four credit data sets.Then,experiments verify that this method outperforms many traditional imputation methods in terms of accuracy and efficiency,especially in application scenarios in which high proportion of missing data occurs and the relationship between variables might change over time.
Keywords/Search Tags:KNN algorithm, Hierarchical clustering, Maximal information coefficient, Missing data imputation
PDF Full Text Request
Related items