Comparative statistical analyses of automated Booleanization methods for data mining programs

Posted on:2000-07-29

Degree:Ph.D

Type:Dissertation

University:City University of New York

Candidate:Imberman, Susan Phyllis

Full Text:PDF

GTID:1468390014961984

Subject:Computer Science

Abstract/Summary:

KDD (Knowledge Discovery in Databases) is the automated discovery of patterns and relationships in large databases. Data mining is one step in the KDD process. Many data mining algorithms and methods find data patterns using techniques such as neural networks, decision trees, statistical analysis, deviation detection, etc. The Boolean Analyzer is a data mining method that finds dependency rules of the form X

⇒

Y. Data is Booleanized with regard to values in a threshold set. That is each data transaction/observation is transformed into a vector of 0's and 1's. Each vector defines a state for that transaction/observation. Vector states can be organized into a state occurrence matrix. From this matrix we can compute a measure of event dependency. A new matrix can be formed called a state linkage matrix that consists of the measures of the event dependency represented by a row and a column in the state occurrence matrix. The values of the state linkage matrix can be used to find the measures of more complex relationships. In turn complex relationships can define rule systems where the measure of a rule system is the value of the complex rule from which the rule system was derived. Rule systems can be more easily implemented that the complex rule.; A significant step in the above process is the Booleanization step. It is this step that most directly affects the statistical significance and strength of the rules generated by the algorithm. In the past expert in the domain where the data analysis was being done determined threshold sets. There is a basis for automating the formation of the threshold set. Rules generated by the Boolean Analyzer algorithm using an expert threshold set were compared with rules generated with threshold sets formed using the mean, mode, median, and clustering of variable values. This yields the conclusion that threshold automation using mean and median values was a valid alternative to threshold formation by an expert. Subsequent analysis using mean and median thresholds on synthetic data with known relationships produced excellent results.

Keywords/Search Tags:

Data, Relationships, Threshold, Using, Statistical

Related items

1	Inferences about threshold effects in macroeconomic relationships
2	Research On Image Denoising Algorithm Based On Non-subsampled Contourlet Transform And Statistical Modeling
3	Adapting masking techniques for estimation problems involving non-monotonic relationships in privacy-preserving data mining
4	Design And Implementation Of Passenrgers Statistical System For Bus
5	The Research Of Fast Constructing Topological Relationships Of Regional Spatial Data
6	Study On Discovering The Relationships Among Data Resources In DataSpace
7	The Design And Implementation Of Statistical Data Collection And Process System Of Dujiangyan Statistical Bureau
8	The Research On Epidemic Dynamics Based On Social Relationships
9	Mining Dynamic Relationships From Spatio-temporal Datasets: An Application to Brain fMRI Data
10	A Method For Statistical Static Timing Analysis At Near-threshold Voltage