The Performance Optimization And Application For The Classifier On Classification Noise Detection

Posted on:2019-11-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Duan

Full Text:PDF

GTID:2428330590465773

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology,people's life is becoming more and more informational,at the same time,it has produced a large amount of data.How to efficiently and reasonably extract effective information from mass data has become an unavoidable problem,and it is also a major challenge for data mining technology.The current data mining algorithms rely on high performance computer and parallel computing technology can quickly handle large amounts of data,when the fixed algorithm to deal with the ever-changing data sets and generate the instability of the result as a result of the difference of data,further affect the performance of the algorithm,in order to solve this problem,a completely random forest noise detection algorithm is proposed in this thesis,the classification noise data is filtered to improve the classification accuracy of the classifier.In order to improve the performance of the classifier,we always hope to achieve the desired results by improving the classification algorithm,but when the data itself has a large number of classified noise,the excellent classifier will also be affected by the noise data and make the wrong.At present,there are few related researches on the classified noise data,and the traditional data preprocessing methods are eliminating the outliers and outliers in general,and do not give a clear definition of the classified noise data.The existing classifiers also seldom take into account the effect of classified noise data on their sexual energy.In this thesis,the main characteristics of classification noise data are analyzed based on the decision tree building process,and a classification noise data detection algorithm based on completely random forest is proposed.Experiments show that the algorithm improves the performance of most classifiers.The main classification algorithms include E-kNNs(Exact k-Nearest Neighbor Algorithms),BPNN(Back Propagation Neural Networks),SVM(Support Vector Machine),and K-means tree(kmeans priority search trees),LR(Logistic Regression),DT(Decision Tree).The noise detection algorithm based on completely random forest consists of two steps: The first step is to establish multiple decision trees by random selection of characteristic attributes,which make up a forest.The second step traverses every decision tree to get the classified noise data.Whether the data sample is noise or not depends on the ratio of the tree determined to noise in the forest,that is,the whole forest is determined by voting.This algorithm focuses on how to set the noise intensity threshold(NI_threshold)to judge a sample is noise data or not,the existing adaptive parameter optimization strategy can achieve the ultimate goal,to get the best classification accuracy,but the time overhead is very large,so a major problem is how to further optimize the algorithm.Because the traditional k-means algorithm randomly selects the initial centers often makes the clustering result is not stable,the initialization method of central point based on maximum density largest distance was proposed in this thesis.This method takes into account the characteristics of cluster centers that they are far away from each other and they are in the center of their own class.First,we select a sample set with higher density as the initial cluster center candidate set,then we select K(K is the cluster number)points with the longest distance as the initial centers points from the candidate set.The experimental results show that the method can obtain more stable clustering results and reduce the number of central point optimization in the clustering process to improve the clustering efficiency.

Keywords/Search Tags:

detection of classified noise data, completely random forest, initial centers in k means clustering, data mining

PDF Full Text Request

Related items

1	The Selection And Improvement Of K-means’s Initial Clustering Centers
2	Research And Application Of K-means Clustering Algorithm
3	Research On The Selection Of Initial Cluster Centers In K-means Algorithm
4	Research On Advertisement Recommendation System Based On Data Mining
5	Improved K-means Algorithm Based On Optimizing Initial Cluster Centers
6	Research On Clustering Methods For High Dimensional Data And Their Application
7	Research On Initial Cluster Centers Choice Algorithm And Clustering For Imbalanced Data
8	PSO-based Spatial Data Clustering Model And Its Application
9	Improvements And Implementation Of K-means Clustering Algorithm
10	The Research Of The K-means Clustering Algorithm Based On Nearest Neighbors