Font Size: a A A

A Large-scale Experimental Evaluation And Analysis Of Data Classification Algorithms

Posted on:2017-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:C C LiuFull Text:PDF
GTID:2348330488451185Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of data classification techniques, more and more classification algorithms for multiclass classification problems were proposed by researchers. But these algorithms and methods have different characteristics, while each of them has pros and cons. Therefore, choosing one or several appropriate algorithms to solve classification problems often needs a lot of time. In fact, in order to save the time of selecting algorithms, people tend to choose some classical algorithms or the widely used ones to solve their problems, which may not the most appropriate for their datasets. Some latest classification algorithms with better performance may be missed. So far, there is no single algorithm that has the best performance for all classification problems. Hence, how to choose the best classification algorithms and methods efficiently has become the focus of researchers and domain experts. Through the comparison and analysis of data classification algorithms of the large-scale experimental, this paper aims at providing researchers and developers with practical guidance of selecting classification algorithms quickly.The large-scale experiment on the classification algorithms in the paper consists of two parts; they are the comparison and analysis of multiclass classification algorithms, and the comparison and analysis of ensemble methods for binary classifiers in multiclass classification, respectively. For the large-scale experiment on classification algorithms, we tested 81 public datasets with 3 latest classification algorithms and 8 most well-known classification algorithms in data mining. We obtain several useful conclusions.GBDT(Gradient Boosting Decision Tree), Random Forests, ELM(Extreme Learning Machine), LibSVM and C4.5 are the top-5 algorithms with the best classification accuracy. Except C4.5 algorithm, which is on one of the earliest classification algorithms, the first 3 algorithms were proposed in recent years. In addition,SRC(Sparse Representation based Classification) algorithm is slightly inferior to the C4.5 in terms of classification accuracy, the low efficiency of SRC algorithm is a serious disadvantage. This paper compares and analyzes the top-5 algorithms in detail by combining the number of classes and features of datasets, to provide a comprehensive comparision.For the large-scale experiments on ensemble methods for binary classifiers, this paper compares and analyzes the performance of 3 types of ensemble methods for binary classifiers(multiclass classificationproblem decomposition strategy) on 31 public datasets. They are OVA(One-vs-All), OVO(One-vs-One)and ECOC(Error-Correcting Output Codes). For OVA and OVO decomposition strategy, we test 9different base classifiers, 3 OVA aggregation rules and 8 OVO aggregation rules. According to the results,we found that the number of datasets where OVA decomposition strategy is more than that of the obtained the best accuracy OVO decomposition strategy. For different OVA aggregation rules and OVO aggregation rules, the choice of classifiers has certain impact on the performance of different OVA and OVO aggregation rules. Furthermore, using OVA and OVO decomposition strategy cannot improve the classification accuracy on all the base classifiers. By using 10 different base classifiers, this paper compares and analyzes the results of the large-scale experiments on the ECOC framework with 3 coding methods and6 decoding methods. The results show that, different combinations of coding and decoding methods have different performance on different base classifiers. Using ECOC framework will effectively improve the classification accuracy on the datasets when choosing a suitable combination of coding and decoding methods.The research results mentioned above are of important reference value and guiding significance for data mining, data analysis and many other practical applications. It will help researchers and engineers to choose algorithms with the highest accuracy of their specific datasets and applications.
Keywords/Search Tags:Data mining, Classification algorithms, Multiclass classification, Aggregation methods for binary classifiers, Ensemble methods
PDF Full Text Request
Related items