Font Size: a A A

Improved variable and value ranking techniques for mining categorical data

Posted on:2006-12-04Degree:Ph.DType:Dissertation
University:The University of AlabamaCandidate:Wang, HuanjingFull Text:PDF
GTID:1458390008472677Subject:Computer Science
Abstract/Summary:
The ever increasing size of datasets used for data mining and machine learning applications has placed a renewed emphasis on algorithm performance and processing strategies. This dissertation research addresses value ranking, variable selection, and processing strategies for mining categorical data. Our research formalizes a value ranking technique, MG (Max Gain), that, when compared to current approaches, produces similar results with faster execution time. We build upon these results to propose a new variable ranking and selection technique named SMGR (Sum Max Gain Ratio). SMGR is shown to provide results similar to established approaches with significantly better runtime performance with near equivalent theoretical complexity.; The empirical performance of MG and SMGR is also compared to existing techniques based on the storage and processing structure of the underlying datasets. Our results confirm previous research which concludes that for certain statistical operations, column major storage and processing outperforms the more common row major approach.; The new value and variable ranking methods are incorporated into the Critical Analysis Reporting Environment (CARE), which is award-winning statistical analysis software developed at The University of Alabama for application to highway safety data. Using a case study example, this dissertation research addresses how to solve highway safety problems using the proposed methods.
Keywords/Search Tags:Data, Value ranking, Mining, Variable
Related items