Font Size: a A A

Knowledge discovery systems for large-scale spatial, time, sequence, and unstructured data

Posted on:2002-09-25Degree:Ph.DType:Dissertation
University:Washington State UniversityCandidate:Vucetic, SlobodanFull Text:PDF
GTID:1468390011496683Subject:Engineering
Abstract/Summary:
This dissertation is composed of eight manuscripts that have been published or are in review for journal and conference publications proposing knowledge discovery systems for large-scale spatial, time, sequence, and unstructured data. In Chapters 2 and 3 the general knowledge discovery methods are proposed that address the issues of efficient learning on large data sets and learning on data with biased class distribution. In Chapter 2 experiments showed that for a small accuracy loss data size can be reduced by several orders of magnitude. Experiments in Chapter 3 show that the proposed methodology can significantly improve classification on unlabeled data. Chapters 4–6 are related to knowledge discovery in spatial data. In Chapter 4 aggregate soil sampling is compared to the traditional point sampling with respect to the quality of spatial estimation. Analysis of point and block sampling techniques showed that for the same sampling density block sampling provides better estimation. In Chapter 5 the proposed simple spatial data partitioning scheme is shown to lead to faster learning and better generalization. In Chapter 6 a supervised machine learning algorithm for the analysis of heterogeneous spatial data based on partitioning the data set into more homogeneous regions by competition of regression models is proposed. The obtained results provide strong evidence that homogeneous regions can be identified with high accuracy by using the proposed approach. The algorithm proposed in Chapter 6 has been modified in Chapter 7 for regime discovery in nonstationary time series and in Chapter 8 for discovery of disorder flavors in protein sequences. In Chapter 7 it is shown that a number of regimes with characteristic behavior existed in the price time series of California deregulated market. The results obtained in Chapter 8 provide strong evidence that at least 3 flavors exist among the disordered regions of 145 examined proteins. Finally, in Chapter 9 a novel regression-based approach for collaborative filtering that achieves improved accuracy and is orders of magnitude faster than the popular neighbor-based alternative is proposed.
Keywords/Search Tags:Data, Knowledge discovery, Spatial, Proposed, Time, Chapter
Related items