Font Size: a A A

Comparative statistical analyses of automated Booleanization methods for data mining programs

Posted on:2000-07-29Degree:Ph.DType:Dissertation
University:City University of New YorkCandidate:Imberman, Susan PhyllisFull Text:PDF
GTID:1468390014961984Subject:Computer Science
Abstract/Summary:
KDD (Knowledge Discovery in Databases) is the automated discovery of patterns and relationships in large databases. Data mining is one step in the KDD process. Many data mining algorithms and methods find data patterns using techniques such as neural networks, decision trees, statistical analysis, deviation detection, etc. The Boolean Analyzer is a data mining method that finds dependency rules of the form X Y. Data is Booleanized with regard to values in a threshold set. That is each data transaction/observation is transformed into a vector of 0's and 1's. Each vector defines a state for that transaction/observation. Vector states can be organized into a state occurrence matrix. From this matrix we can compute a measure of event dependency. A new matrix can be formed called a state linkage matrix that consists of the measures of the event dependency represented by a row and a column in the state occurrence matrix. The values of the state linkage matrix can be used to find the measures of more complex relationships. In turn complex relationships can define rule systems where the measure of a rule system is the value of the complex rule from which the rule system was derived. Rule systems can be more easily implemented that the complex rule.; A significant step in the above process is the Booleanization step. It is this step that most directly affects the statistical significance and strength of the rules generated by the algorithm. In the past expert in the domain where the data analysis was being done determined threshold sets. There is a basis for automating the formation of the threshold set. Rules generated by the Boolean Analyzer algorithm using an expert threshold set were compared with rules generated with threshold sets formed using the mean, mode, median, and clustering of variable values. This yields the conclusion that threshold automation using mean and median values was a valid alternative to threshold formation by an expert. Subsequent analysis using mean and median thresholds on synthetic data with known relationships produced excellent results.
Keywords/Search Tags:Data, Relationships, Threshold, Using, Statistical
Related items