Statistical models for data mining: General inferences and class discovery in large databases

Posted on:2004-03-27

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:Browning, John Duncan

Full Text:PDF

GTID:2468390011459075

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

This thesis is about the application of statistical models to data mining. Data mining is searching for patterns in large data sets. With the introduction of cheaper storage devices with high capacity, faster communication and increasing computer power, large databases can be searched, or ‘mined’ for correlations in the data. These databases can be created by business applications, biological applications, from work in astronomy, weather forecasting, natural language applications, speech recognition and many other areas. Typically these databases are much larger than traditional pattern recognition databases so that algorithms used on these databases must be able to scale with the data. A second identifying trait of data mining applications is missing and erroneous data. When this data is collected errors can occur during data entry or data can be missing, either randomly or deterministically. One advantage of statistical models is that they are based on mathematical theory which enables a principled approach to missing/erroneous data. We investigate application of statistical models to two data mining tasks that have a lot of missing data. The first is collaborative filtering, which is inference when most of the data is missing. The second application is a new problem, where some of the data comes from unknown classes that we have to discover. This problem is related to data clustering.

Keywords/Search Tags:

Data mining, Statistical models, Databases

PDF Full Text Request

Related items

1	The Analysis Of Statistical Models In Data Mining
2	Statistical learning and data mining in biological databases
3	Integration of multiple prediction models for centralized and distributed knowledge discovery in databases
4	Research On Context-Based Statistical Relational Learning
5	Intelligent Intrusion Detection, Data Mining
6	Latent factor models for statistical relational learning
7	Statistical learning from relational databases
8	Shape mining in three dimensional models databases
9	Statistical data mining and its applications in health claims
10	Statistical mining in data streams