Summarizing massive information for querying web sources and data streams

Posted on:2015-09-14

Degree:Ph.D

Type:Dissertation

University:University of California, Los Angeles

Candidate:Mousavi, Hamid

Full Text:PDF

GTID:1478390020450659

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Largely as a result of advances brought by the Web and related technologies, we are now experiencing a tremendous growth in the volume of data streaming between, and stored at, many nodes of the Internet. This "Big Data" revolution is underscoring the importance of summarization in general, and in particular in two new application areas that are rich of practical significance and interesting research challenges. Indeed, while summarization techniques, including sampling, histograms, and quantiles, remain critical in analyzing large data sets and optimizing queries in traditional databases, new techniques are needed to address the following two problems. The first is that, in addition to summarization techniques for stored data, we now need online/continuous summaries for the streaming data, e.g., real-time online histograms. When dealing with massive data streams and fast-changing distributions, summaries should be quickly updated with the newly arrived data, in order to reflect the most recent portion (window) of the data stream. The second problem is that the Web is storing large corpora of structured, semi-structured, and unstructured (free-text) documents, and these documents are subject to the ambiguities of natural language and the challenges they pose to machine processing. This situation has so far limited severely the ability of smart applications to use the information contained in Web pages, as needed to realize the Semantic Web vision. It is however clear that many of these limitations can be overcome and advanced searches and analysis applications can be supported, if the knowledge of each Web page can be summarized into a standard machine-friendly structure. In this dissertation, we attack these two difficult problems by proposing fast summarization techniques for (i) scalar information of data streams and (ii) textual information in Web pages. For scalar data, we present light and fast synopses, namely histograms, combined with various sampling approaches in order to implement more practical summarization techniques over massive data sets and data streams. To our knowledge, this technique provides the most accurate online histograms for data streams with sliding windows. For textual documents, we introduce several techniques and systems for extracting structured summaries from unstructured text and use these structured summaries to complete the existing ones as well as to improve their consistency.

Keywords/Search Tags:

Data, Web, Information, Summarization techniques, Massive, Summaries

PDF Full Text Request

Related items

1	A Nearest-Neighbor Approach to Indicative Web Summarization
2	Multimedia summarization and personalization of structured video
3	Research On Key Technologies Of Automatic Summarization Of Chinese News Documents
4	Extractive speech summarization using structural modeling
5	Research And Implementation Of Massive Data Disaster Recovery Based On Data Summarisation Technology
6	Automatic Summarization Of Multimedia Information And Related Technology Research,
7	Machine learning in automatic text summarization: From extracting to abstracting
8	Research On Techniques Of Graph-Based Abstractive Text Summarization
9	Semantic Analysis for Improved Multi-document Summarization of Text
10	Research On Big Data Graphs Summarization Theory And Applications In The Cloud Environment