Font Size: a A A

Decision Tree Classification Algorithm And Its Land Tax Collection And Management System

Posted on:2011-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:H RenFull Text:PDF
GTID:2208330332473027Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Our country achieve tax management information technology has been more than 20 years of history, in the actual operation of the more than 20 years, tax management information system has accumulated a lot of tax data, how to select, distinguish and analyze these numerous and complicated data to extract valuable knowledge from the information and rules which hidden behind them for tax management in decision-making services has become an urgent task at present tax administration.Classification algorithm is no doubt an effective way to deal with disordered and complicated data. At present the universal application of the classification technologies are mainly neural networks, genetic algorithms, Bayesian classification, decision tree and so on. Compared with the several popular classification algorithms, decision tree shows its unique advantages, its specific performances are:it can be expressed more obviously when the number of records in the database is more greater, and in comparison with other classification algorithms, decision tree classification has the same and sometimes higher accuracy.Based on the outstanding merits which decision tree reflected, aimed at the massive tax data which the taxation collection and management system has, using decision tree classification algorithms to dig out will definitely solve the actual problems which the current taxation collection and management departments has in the actual work, and it will certainly improve the efficiency in their management and decision-making.The main purpose of this research is to study in-depth in decision tree classification algorithms, above these theories to make them be applied to diagnose one key sources of revenue to cities and counties in the taxation collection and management system of Jilin Province. The main research activities include:Firstly, the paper focuses on the traditional decision tree classification algorithm ID3, C4.5 and CART algorithms, pointed to the characteristics of the actual tax data, it can make discovery that the traditional decision tree algorithms are mainly for the small data sets, and most of them have to always make training set in permanent memory, which makes the traditional decision tree algorithms restricted in scalability, accuracy and efficiency. Therefore, the traditional decision tree classification algorithms are not applicable to massive data analysis.Secondly, based on the actual characteristics of the tax data, this paper analyzed the decision tree classification algorithms SLIQ, SPRINT, Rainforest and other algorithms which are applicable for the massive data processing, give a detailed description from their data structure, the methods of splitting properties.Finally, this article put the decision tree classification algorithms which applied to the massive data processing into practice in the taxation collection and management system, analyzed that the key sources of revenue to cities and counties are existing in the larger paid storage revenue, and on the scale of large and medium enterprises. Moreover on the basis of them, this article not only summed up the specific data of the key sources of tax revenue to cities and counties account at the end of 2009, but also give a detailed analysis of the reasons. In addition, this article also based on the decision tree classification algorithms which are applied to massive data processing put into practical application, analyzed the actual performance in the aspects of serial, parallel, and scalability, the following conclusions are obtained:In the case of the serial:the size of the training set which can make the class table reside on memory, for the same amount of data in the time spent on, SLIQ algorithm will be less than the serial SPRINT algorithm. When the size of the training set near to reach 1 million records, SLIQ algorithm will not run, and it will have a system bumps. At the same with the training set increases, the time the serial SPRINT algorithm spent, will still maintaining increase in linear growth.In the parallel case, compares the time to two parallel SLIQ algorithm methods and the parallel SPRINT to generate decision tree, the time used to find the best splitting point, the time used to implement splitting, it can obtained in the above three stage that the parallel SPRINT algorithm outperforms the parallel SLIQ algorithm SLIQ/D and SLIQ/R at the cost of time.Scalability is the critical solution to the validity of algorithms when the algorithms are running which required increasing data amount. A good scalability of classification algorithm means it can handle large training set and a higher classification accuracy. Therefore, this study based on the parallel SPRINT algorithm also learned the scalability in the parallel environment, the results show that the SPRINT algorithm in the parallel environment has good scalability.Experimental results show that compared to the traditional decision tree classification algorithms, the decision tree classification algorithms which applied for mass data processing, it speeds up tax data of data mining, improves the efficiency of data processing, and shows a superior performance at scalability.
Keywords/Search Tags:Decision tree classification algorithm, SLIQ, SPRINT, Parallel, Scalability
PDF Full Text Request
Related items