Font Size: a A A

Research On The Decision-Tree-Based Prediction System Of Massive Time-Serial Unbalanced Data

Posted on:2006-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ShaoFull Text:PDF
GTID:1118360155458156Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
From the end of last century, with the Data Mining technology maturing gradually, the research on applying Data Mining technology into fraud detection becomes one important research field. Here, the data in such type applications have the same basic attributes: massive, time-serial and unbalanced. Aiming at those attributes, the author of this thesis extensively and intensively researches on the massive, time-serial and unbalanced prediction problem from four aspects: attribute construction technology, splitting measure theory, splitting measure experiment method, and the application method of Data Mining prediction model, and proposes a decision-tree-based prediction system of massive time-serial unbalanced data. The major work are followed: Attribute Construction Technology in Data Pre-Processing Process(1) Three rules of the relationship consistency between attributes while attribute being constructed are proposed. These rules make the condition using attribute construction in Data Mining applications standardization. The attribute relationship inconsistency problem caused by using count operators without limit is proposed in this thesis, which could build fraudulent data and make the model come from count operator invalid in the practical application in the end.(2) Time-serial count operator and its incremental algorithm are proposed. Time-serial count operator based on those rules could avoid the attribute relationship inconsistency problem. To reduce the computing cost of time-serial count operator, the incremental algorithm of time-serial count operator is also proposed. To the application systems supporting incremental data, since the transaction period could satisfy the requirement incremental algorithm needed, algorithm only works a small quantity of incremental data. And this algorithm has very high value in applications. Splitting Measure Theory(3) The linearity distance rule and generalized distance rule of splitting measure are proposed. First, the purposes of the research on decision tree and the application of splitting measure. Second, the equivalence relationship between decision tree is defined, the transformability of splitting measure is proposed, and the primary form ofsplitting measure parameter, simple parameter matrix, is shown with the important parameter problem of impurity theory. Then, an ideal splitting measure rule, linearity distance measure rule is proposed. With the analysis, the integrated way with the effect of the splitting measures and their interests and generalized distance measure rule, which presents the mathematic commonness of all the splitting measures, are proposed. And the current measure theory and family have been proven to comply with generalized distance measure rule, and the problems of the measure aiming at continuous attribute are presented.S Splitting Measure Experiment Method(4) A walkthrough experiment method, which could show the attribute of splitting measures more full and profoundly, is proposed, and by these experiments the generalized distance theory is partly proved while the best splitting measure of is found. In the experiments with the most numerous splitting measures, two data construction algorithms based on simple parameter matrix and contingency table are proposed. Then by comparing the splitting measure value in various distribution, we further analyze the measure value surface, validate whether measure satisfy the minimum and maximum sub-rule of generalized distance measure rule, evaluate the computing complex of measures, test the measures with multi-split bias, test the core function with concave function, and test the measures with majority-class bias. The experiment results show that for the massive unbalanced data chi-square is better than other measures and all measures satisfy the minimum and maximum sub-rule of generalized distance measure rule.S Application Method of Data Mining Prediction Model(5) A multi-strategy framework of massive time-serial unbalanced prediction system is proposed in this thesis. Aiming at the generic massive time-serial unbalanced prediction problem, to improve the effectiveness of data-mining applications, with the aforementioned research result, it uses hybrid algorithm of decision tree and neural network, supports the whole two-stage data-mining process, multi-level user, process visualization, on-line fraud detection with planning audit strategy, the balance strategy between audit gain and cost, the expert-accrediting multi-classifier prediction model, distributed multi-mission management and so on.At last, the author designs and implements one decision-tree-based massive time-serial unbalanced prediction system, with the background of Customs declaration fraud detection project and the aforementioned research results. The data manipulate...
Keywords/Search Tags:Data Mining, KDD, Decision Tree, Fraud Detection, Attribute Construction, Count Operator, Prediction System of Massive Time-Serial Unbalanced Data
PDF Full Text Request
Related items