Font Size: a A A

Research On The Automatic Lassification Algorithm Of Archive Text Based On Decision Tree

Posted on:2016-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:S F HuangFull Text:PDF
GTID:2298330470454086Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
In the data explosion era, how to extract data from a mass of data which we need is a big problem that we meet. The technology of data mining is to solve this problem, becomes a hot and a focus research for experts and scholars. Text classification to further narrow the scope of data mining, has become a very important research field in data mining. A good classification model and an optimal modeling method, not only can reduce the time required for text classification cost, but also improve the accuracy of text classification. So, how to quickly establish a classification model, how to reduce the time required for text classification cost and how to improve the accuracy of text classification are research foci of this paper.In this paper, on the basis of C4.5algorithm, by introducing the concept of Equivalent Infinitesimal in higher mathematics, a calculation formula of the C4.5algorithm in the complex logarithm is improved. The improved with simple four mixed computing instead of the complex logarithmic operation of C4.5algorithm, eliminates the process which computer to calculate the log need to call library function, reduces the time cost which C4.5algorithm generates a decision tree, thereby reducing the time cost of the text classification process. When the demand has changed, the original is no longer meet the needs of the decision tree, decision attribute has to change. At the same time, according to the new changes, we do not have the training data set ready. Aiming at this problem, this paper presents a method of generating a decision tree by the classification rules, which comprises the following steps. Firstly, artificially makes classification rules according to the demand and experience of human. Secondly, it generates decision tree through production rules n. Finally, it adjusts the decision tree classification using machine learning methods, so as to meet the requirements of the current.To make a long story short, this paper always regards reducing the time cost of generation decision trees and how to build a decision tree quickly under the condition of having no training data set as the study purposes. Therefore, through the optimization and improvement of the calculation formula of the C4.5algorithm and according to present a method of direct conversion from classification rules to the decision tree, the paper achieves the original intention of it, and uses example analysis and experimental results to verify the effectiveness of the improved method in it. Finally, the improved method is applied to the Yunnan cigarette factory records text data classification, obtaining a good effect.
Keywords/Search Tags:Text classification, C4.5algorithm, Production rules, Classification rule, Algorithm optimization
PDF Full Text Request
Related items