Font Size: a A A

A comparison of data mining methods for binary response variables in direct marketing

Posted on:2010-09-25Degree:Ph.DType:Dissertation
University:University of Illinois at ChicagoCandidate:Sparks, John JamesFull Text:PDF
GTID:1448390002988001Subject:Business Administration
Abstract/Summary:
The past 20 years have seen an expansion in the number of data mining tools available to predict binary response. A number of studies comparing model performance have appeared in marketing and management academic journals. Most of these studies, however, have compared a small number of algorithms and used only one or two data sets, making generalization difficult. This study seeks to advance the state of the research by using a large number of academic data sets from the direct marketing field (nine) and a set of algorithms that are commonly available in commercial software packages (five). Five data preparation schemes are also used to test for the effect of pre-processing.;Results showed that linear methods had the highest average performance, as did reducing the number of variables to the 15 strongest. Interaction effects, however, were present and the single best combination was neural network with the top 15 variables. Performance of the algorithms on the holdout sample was related to the "global R-square" of the analysis sample and the response rate. The relative performance of neural net vs. logistic regression showed a relationship to a number of characteristics of the data file. Most of them were related to the theme of the strength of the relationship between the dependent and independent variables and the response rate. Two newer algorithms, MARS and random forests, were analyzed in a limited comparison to determine whether their performance is superior to that of the older methods. Results showed that the operationalization of MARS in the R package provided no benefit when the default settings were left unchanged. The random forest algorithm garnered a remarkable improvement in performance for one of the five files that it processed.;This work provides a basis for future research to expand the number of algorithms for comparison, build a larger number of datasets through simulation in order to further increase generalizability, study the comparative performance of algorithms for both response and dollars spent as well as the affect of drift.
Keywords/Search Tags:Response, Data, Performance, Algorithms, Variables, Comparison, Methods
Related items