Font Size: a A A

Research On KNN Algorithm Of Heterogeneous Data Under Non-independent And Identical Distribution

Posted on:2022-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:C H SunFull Text:PDF
GTID:2518306323960439Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data mining,as the name implies,is to dig out information that is useful for enterprises or individuals from massive data.Because data mining can create greater benefits for enterprises,data mining related algorithms and technologies emerge in endlessly,and have been widely used in many The field has become an indispensable driving force for the development of big data.Classification analysis is one of the classic algorithms in data mining algorithms.Among them,KNN algorithm is widely used in the field of data mining because of its simple principle,easy to understand and easy to implement.But because it also has some shortcomings,such as unbalanced training samples,redundant features,etc.,which will affect the classification results,many scholars have proposed improvements.Although the traditional KNN algorithm and many improved algorithms are based on independent and identical distribution,most of the data in real life exists in the form of non-independent and identical distribution,that is to say,between data objects,between attributes of data objects,and attribute values There will be certain mutual connections between them.If these mutual connections are ignored,some important information will be missed,which will lead to inaccurate classification results.Therefore,this paper improves the KNN algorithm based on the idea of non-independent and identical distribution.The main research work of this article includes the following three points:First,for numerical data,in order to solve the problem of inaccurate classification results due to unbalanced training samples and susceptibility to a single attribute in the traditional KNN algorithm,a CFW-KNN algorithm based on class membership and feature weights is proposed.The algorithm calculates the data density to determine the center point and radius of the ball to establish the minimum bounding ball,and determines the class membership according to the location of the training sample,and then calculates the feature weight through the Relief F algorithm idea,and finally according to the class membership and feature of the training sample The weight updates the category decision rule to determine the category of the sample to be classified.The experimental results show that the CFW-KNN algorithm can make the classification results more accurate and improve the classification accuracy.Second,for numerical data,the idea of non-independent and identical distribution is used in the improved CFW-KNN algorithm,and a NIID?CFW?KNN algorithm under non-independent and identical distribution is proposed by mining various implicit relationships in the data set.The algorithm first uses the improved Pearson correlation coefficient formula to integrate the coupling similarity matrix of the data objects,transforms the original data set into a new data set with coupling relationships,and applies the new data set to the CFW-KNN algorithm for classification analysis.The experimental results show that the classification accuracy of the NIID?CFW?KNN algorithm has been further improved.Third,for heterogeneous data,based on the NIID?CFW?KNN algorithm,the idea of non-independent and identical distribution is used to analyze the global coupling relationship between categorical data and numerical data,and a NIID?MCFW?KNN for heterogeneous data under non-independent and identical distribution is proposed.algorithm.The algorithm fully digs out the coupling relationship existing in the categorical data,the numerical data,and between the categorical data and the numerical data,and applies the new data set with the coupling relationship to the CFW-KNN algorithm for classification analysis.Experimental results prove that the NIID?MCFW?KNN algorithm has a good classification effect on heterogeneous data.
Keywords/Search Tags:Non-independent and identically distributed, KNN algorithm, Class membership, Feature weight, Heterogeneous data
PDF Full Text Request
Related items