Font Size: a A A

Data mining techniques for structured and semistructured data

Posted on:2001-12-27Degree:Ph.DType:Thesis
University:Stanford UniversityCandidate:Nestorov, Svetlozar EvtimovFull Text:PDF
GTID:2468390014458288Subject:Computer Science
Abstract/Summary:
Data mining is the application of sophisticated analysis to large amounts of data in order to discover new knowledge in the form of patterns, trends, and associations. With the advent of the World Wide Web, the amount of data stored and accessible electronically has grown tremendously and the process of knowledge discovery (data mining) from this data has become very important for the business and scientific-research communities alike.; This doctoral thesis introduces Query Flocks, a general framework over relational data that enables the declarative formulation, systematic optimization, and efficient processing of a large class of mining queries. In Query Flocks, each mining problem is expressed as a datalog query with parameters and a filter condition. In the optimization phase, a query flock is transformed into a sequence of simpler queries that can be executed efficiently. As a proof of concept, Query Flocks have been integrated with a conventional database system and the thesis reports on the architectural issues and performance results.; While the Query-Flock framework is well suited for relational data, it has limited use for semistructured data, i.e., nested data with implicit and/or irregular structure, e.g. web pages. The lack of an explicit fixed schema makes semistructured data easy to generate or extract but hard to browse and query. This thesis presents methods for structure discovery in semistructured data that alleviate this problem. The discovered structure can be of varying precision and complexity. The thesis introduces an algorithm for deriving a schema-by-example and an algorithm for extracting an approximate schema in the form of a datalog program.
Keywords/Search Tags:Data, Mining
Related items