Font Size: a A A

The Decision Tree Algorithm Based On Large Databases And Implementation

Posted on:2008-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:H ChangFull Text:PDF
GTID:2208360215966878Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Classification is one of important basic tasks in the field of datamining and machine learning. It can be used to analyze and study a vast number of related data and establish classifying models in many areas of related problems. The classification techniques haveextensive application usage in scientific research, communication, finance and other fields. A decision-tree classifier is a very important model in the process of knowledge discovery. Good interpretability, fast classification speed and excellent classification performance of decision-tree make it gradually become the research focus in the fields of data mining and machine learning.The most classical decision-tree learning system is ID3, which use the divide-and-conquer approach to decision-tree induction from root to leaves, and choose the spliting attributes by the information gain. This method can ensure to construct a simple tree. But ID3 can not handle numeric attributes, only nominal attributes. It is usually overfited to the training databases. C4.5 algorithm is the extension of ID3. It extends the classification ability of ID3 from nominal attributes to numeric attributes. It well resolves the problem about overfiting by pruning decision-trees. Now it has already been known as a beter decision-tree classifier.In a real application, we build decision tree which is based on large database with massive data. How to integrate the building of decision tree with database technology is a problom worth to research, so, many previous algorithm are studied and extended over again.Thispaper focuses on the study of scalable classification algorithm that tightly integrates the building of decision tree with database technology. We use SQL to realize the computation data pre-processing and attribute selection measure, and store dicision tree in relational database. In this paper, not only training set Used in building dicision tree but also the subset of training set is defined by view; In the procedure of building tree, the main compution task is realized with standard database system language SQL. The classification algorithm based view make use of the processing capacity of large database and easily realized. At the end of paper, examination was designed based on KDD CUP 2004 data. The data was loaded in relational database and preprocessed with SQL, and dicision tree was builded and stored in database. By the examination, it is proved that building dicision tree with the processing capacity of large database is available and efficient.
Keywords/Search Tags:data mining, dicision tree, view, SQL
PDF Full Text Request
Related items