Font Size: a A A

A scalable hybrid model for health care insurance fraud detection using association rules and random forest

Posted on:2016-07-16Degree:M.EngType:Thesis
University:State University of New York at BinghamtonCandidate:Alqudah, Mohammad KhaledFull Text:PDF
GTID:2479390017476966Subject:Industrial Engineering
Abstract/Summary:
Fraud detection is becoming an increasing area of focus in the health care industry due to its major effects on health care expenses and quality of service. Therefore, this research proposes a novel approach, the Hybrid Association Rules and Random Forest (HARRF), to detect fraud in health care insurance claims. With HARRF, frequent patterns extracted through Frequent Pattern Growth (FP-growth) are used to construct the Association Rules. Then HARRF utilizes the extracted Association Rules as a new feature space for the data. The extracted Association Rules are filtered and used to transform the training data to the new feature space, which results in the Transformed Feature Matrix (TFM). The TFM process unifies the feature space for the claims as well as condensing the information and reducing the dataset size. Next, the TFM is utilized as the input to train the Random Forest (RF) classifier. Similarly, the testing data is transformed to a separate TFM using the same feature space. In this research, a public insurance claim dataset for Medicare (DE-SynPUF) is used to train and validate the proposed methodology. This dataset has 160 million claims for 2.4 million beneficiaries. HARRF is validated through several experiments and a 5-fold cross-validation. In addition, design of experiments is used to identify parameters critical to the prediction accuracy. As a result, parameter tuning strategies are identified. After training the model, the average model accuracy achieved through cross-validation is 90%. Because of the size of the data used, distributed computing (Hadoop) is utilized to train and test the proposed methodology. Finally, this research studied the effects of the number of Hadoop nodes on RF performance time.
Keywords/Search Tags:Health care, Association rules, Random, Insurance, Model, Feature space, TFM, HARRF
Related items