Font Size: a A A

Real-Time Query Systems for Complex Data Sources

Posted on:2012-10-10Degree:Ph.DType:Thesis
University:Harvard UniversityCandidate:Rose, Ian ThomasFull Text:PDF
GTID:2468390011458448Subject:Computer Science
Abstract/Summary:
This dissertation presents techniques for building scalable systems that allow real-time querying of complex data sources. In recent years, networking and sensing advances have dramatically increased the volume of information available to data consumers. However, coping with large scales and high data rates often requires processing data in real time, as it arrives, rather than storing it for later analysis. Our thesis is that by including the data acquisition process in the overall system design, it is possible to build scalable, real-time stream processing systems for complex data sources.;We have built two systems to demonstrate a number of unique design features required for scalable operation in our chosen domains. Cobra is a system that taps online RSS feeds (such as blogs, news articles and websites' user comments) as its data source. Cobra repeatedly crawls a set of RSS feeds, matching the contents to keyword-based user queries, similar to those used in Web search engines. As RSS-based content can change frequently, the design ensures that the latency between crawls is low, while still scaling to a large number of RSS feeds and many concurrent user queries.;Secondly, Argos is a system for widely-distributed, outdoor wireless network monitoring. Capturing 802.11 WiFi traffic across a large urban area, Argos enables a wide range of user queries, such as mobile node tracking, malware detection, and traffic characterization. Use of a wireless mesh network to connect the deployed sniffer nodes introduces additional challenges due to its limited bandwidth capacity. To address this restriction, we designed a novel in-network packet merging process and demonstrate its bandwidth savings. Additionally, Argos provides a variety of channel management schemes; 802.11 defines up to 14 radio channels but each sniffer can only capture from one channel at a time, necessitating policies for when to capture from which channel.;These systems are built around three design principles that aid in the real-time querying of complex data sources: query interfaces tailored to the application's specific data types, optimized data collection processes, and allowing queries to provide feedback to the collection process.
Keywords/Search Tags:Data, Systems, Real-time, RSS feeds, Queries
Related items