Font Size: a A A

Probability estimation in multi-relational domains

Posted on:2006-05-25Degree:Ph.DType:Dissertation
University:New York University, Graduate School of Business AdministrationCandidate:Perlich, ClaudiaFull Text:PDF
GTID:1458390008957544Subject:Computer Science
Abstract/Summary:
Modern businesses collect and store vast amounts of data on business transactions, customers, and accounting activities. The relational data format provides the necessary flexibility to represent different entities and the relations between them to capture the reality of complex business domains. The analysis of data in a relational format with multiple entity types in different tables traditionally required the manual construction of variables using summary statistics and aggregates such as recency, frequency, and average price. This manual process has become not only increasingly more time consuming as the complexity of domains increased, but also involves significant loss of information, and prevents the discovery of interesting novel relationships.; The objective of this dissertation is to develop a reliable relational modeling approach that provides good generalization performance on noisy domains. For this purpose we study, theoretically and empirically, the properties of relational classification methods in general and of aggregation methods for automated feature construction in particular.; Our theoretical contributions to the statistical relational learning literature include: (1) The presentation of a hierarchy of relational concepts of increasing complexity, using relational schema characteristics such as cardinality, and the derivation of classes of aggregation operators that are needed to express and learn these concepts; (2) The analysis of desirable properties of aggregation operators for predictive modeling with the ultimate goal of limiting the loss of predictive information while providing a suitable representation for model induction; (3) The formalization of a "relational fixed-effect" model, that extends the traditional Naive Bayes classifier to relational domains and makes three simplifying assumptions: class-conditional independence between attributes and bags, independent identically distributed (iid) samples of related objects, and the existence of only two distributions from which related objects could have been drawn; (4) The derivation of a new class of aggregation operators that take advantage of conditioning on the class label to produce predictive features that improve generalization performance; (5) The exploration of opportunities arising from the aggregation of object identifiers including the ability to learn from unobserved object characteristics and the learning of concepts that violate the above mentioned assumptions.; This new aggregation and feature construction approach is implemented in a prototype for "Automated Construction of Relational Features," (henceforth ACoRA). A large-scale empirical evaluation of ACoRA's generalization performance on classification and probability estimation tasks shows superior performance over alternative aggregation methods across a variety of complex and noisy application domains, including direct marketing for online retailing, citation-based document classification, medical diagnostics, prediction of bank-loan default, customer classification for life insurance, and terrorist identification.
Keywords/Search Tags:Relational, Domains, Classification
Related items