Font Size: a A A

Research And Application On Three-Decision KNN Algorithm Based On Incremental Learning

Posted on:2019-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:J CaoFull Text:PDF
GTID:2428330566467901Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the Internet has grown very rapidly,and the amount of online text data has increased significantly.Therefore,it is necessary to provide automatic and effective categorization for these large and unstructured text resources.K Nearest Neighbor(KNN)is a simple non-parametric categorization method with good effects.Therefore,KNN is widely used in text classification to deal with more difficult problems.In this thesis,a new model is proposed to improve the problem of the existing model baseed on the the full study of the traditional K nearest neighbor algorithm and its improved algorithm in recent years,the existing problems are improved and a new model is proposed.The main research contents of the thesis include:(1)For the shortcoming that K-nearest neighbor algorithm similarity calculation of the sample is very complex and can not be determined that the class sample can not be handled specially,a three-decision KNN algorithm based on incremental learning is proposed.First,a large training sample data set is divided into data chuncks;then,a classification model based on incremental clustering is constructed,each time a data point in a data chuncks is clustered,and then the current cluster and the next data chunck are input.The cluster retains learning information about the data chuncks,so only the new input data chunck need to be clustered;then,on this basis,the KNN classifier is trained based on three-decision,and the boundary region sample is processed in a special way to improve the accuracy of processing large amounts of text.;Finally,based on this classification model for data processing.Experimental results show that the three-decision KNN algorithm based on incremental learning proposed in this thesis is superior to the traditional K-nearest neighbor algorithm and the other two improved K-nearest neighbors.(2)Due to time and memory constraints,K-nearest neighbor algorithm is not widely used in the field of big data.To solve this problem,the optimization of KNN classifier based on Spark is proposed.In the MapReduce phase,the Map phase segments the training data and calculates the distances and corresponding categories of k nearest neighbors for each test sample for each chunck.The Reduce phase aggregates the distances of the k nearest neighbors in each Map and determines the k nearest neighbors.When the size of the test set is very large,split the test data set and execute multiple iterations of the MapReduce process defined above and implement it under Spark.The experimental results show that the optimization of the KNN classifier based on Spark improves the performance and reduces the time consumption.(3)Based on the above work,the improved K-nearest-neighbor algorithm is applied in the field of intrusion detection to actively discover attack events.Because the improved algorithm proposed in this thesis has the characteristics of incremental learning,it is more suitable for applications under massive data conditions.The KDD CUP99 data and the experimental results obtained above indicate that this method is feasible.
Keywords/Search Tags:K Nearest Neighbor, Incremental Clustering, Three-Decision, Spark Parallel Computing, Intrusion Detection
PDF Full Text Request
Related items