Font Size: a A A

Massive search for detecting group differences

Posted on:2002-07-03Degree:Ph.DType:Dissertation
University:University of California, IrvineCandidate:Bay, Stephen DongjunFull Text:PDF
GTID:1468390011490415Subject:Computer Science
Abstract/Summary:
Comparing objects is a natural method for understanding their properties, especially when one object is well known and serves as a reference. With the availability of large databases of information, many analysts want to compare various groups in their data to understand the differences between them. For example, an admissions officer at UCI may be interested in comparing student applicants that accept UCI's admission offer to those that decline. A demographer may be interested in comparing the decennial Census databases to track how the Los Angeles - Long Beach population has been changing over the past few decades.; Because electronic data collection is easy, many data sets are very large and have many variables and examples making automated computer analysis mandatory. However, a straightforward approach where the computer considers every combination of measurement variables as a potential difference is infeasible because the number of candidates grows exponentially and quickly outstrips the processing power of modern computers. The huge number of candidates raises three major research questions: First, how do we deal with the computational cost of searching for differences in this extremely large space of candidates? Second, how do we keep false positives (errors) from accumulating during the search and dominating the results? Finally, there may be a substantial number of differences between the groups. How can the results be presented so they are easily understood by human analysts?; In my dissertation, I address these questions and I develop a computer tool that finds differences between groups from observational multivariate data. I demonstrate that this tool can analyze data in an exploratory manner and I then show how it can serve as an important component in other novel knowledge discovery algorithms such as multivariate discretization of continuous variables and characterizing classification models.
Keywords/Search Tags:Comparing, Variables
Related items