Font Size: a A A

On an incomplete data problem in modeling: Evidence from Web usage mining and a general purpose solution

Posted on:2004-11-26Degree:Ph.DType:Thesis
University:University of PennsylvaniaCandidate:Zheng, ZhiqiangFull Text:PDF
GTID:2468390011470399Subject:Business Administration
Abstract/Summary:
In business domains, firms often only have incomplete information on their customers. Acquiring complete information for all customers can prove prohibitively expensive. This dissertation shows how selective information acquisition can reduce the amount of information to supplant incomplete customer information.; One example of incomplete customer information stems from the web usage domain. As revealed in this thesis, the data collected locally by a single firm on its customers' accesses to its web site (site-centric data) is inherently incomplete, because it does not capture user behavior across sites. While most users search multiple sites in a session, site-centric data only captures a tree in the forest. By only looking into a tree, can a site be able to accurately capture consumer online behavior and subsequently build correct customer models? The first half of this thesis investigates this problem and empirically demonstrates that incomplete data not only hurts model performance, but more importantly, can lead to erroneous managerial decisions.; The naïve solution to the above incomplete data problem—acquiring the complete data for all customers, is often impractical due to cost. A natural alternative is to acquire complete data for some customers and to use this to improve the models built. We define selective data acquisition as the task of determining how many, and which, customers from whom we might acquire additional data. Our solution to the problem employs a utility function to discern the value of a specific customer's data to the model. In the second half of this thesis we develop two specific utility functions for logistic regressions and decision trees respectively. We empirically test the methods on web usage data provided by Jupiter Media Metrix and common UCI datasets. The results show that the methods perform well and indicate that selective data acquisition is a promising area for research.
Keywords/Search Tags:Data, Incomplete, Web usage, Information, Customers, Problem
Related items