Font Size: a A A

Statistical models for data mining: General inferences and class discovery in large databases

Posted on:2004-03-27Degree:Ph.DType:Thesis
University:The Pennsylvania State UniversityCandidate:Browning, John DuncanFull Text:PDF
GTID:2468390011459075Subject:Engineering
Abstract/Summary:
This thesis is about the application of statistical models to data mining. Data mining is searching for patterns in large data sets. With the introduction of cheaper storage devices with high capacity, faster communication and increasing computer power, large databases can be searched, or ‘mined’ for correlations in the data. These databases can be created by business applications, biological applications, from work in astronomy, weather forecasting, natural language applications, speech recognition and many other areas. Typically these databases are much larger than traditional pattern recognition databases so that algorithms used on these databases must be able to scale with the data. A second identifying trait of data mining applications is missing and erroneous data. When this data is collected errors can occur during data entry or data can be missing, either randomly or deterministically. One advantage of statistical models is that they are based on mathematical theory which enables a principled approach to missing/erroneous data. We investigate application of statistical models to two data mining tasks that have a lot of missing data. The first is collaborative filtering, which is inference when most of the data is missing. The second application is a new problem, where some of the data comes from unknown classes that we have to discover. This problem is related to data clustering.
Keywords/Search Tags:Data mining, Statistical models, Databases
Related items