Font Size: a A A

An Exploration Of Evolutionary Summarization For Web Information

Posted on:2014-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:R YanFull Text:PDF
GTID:1228330392462191Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Nowadays, Multi-Document Summarization has long been an exciting and chal-lenging field of Natural Language Processing (NLP) and Information Retrieval (IR)joint research in modern computer science for tens of years. Faced with the rapidinformation explosion on the World Wide Web, information seekers can hardly keeppace with the overloaded new updates. News floods spread throughout the Internet andhence readers get drown in the”sea” of overwhelming information, wondering whereto access. As a result, news digestion becomes increasingly essential in Web contentsanalysis.For an evolutionary news topic, people may have the myriad of general interestsabout the beginning, the evolution or the most up to date situation. However, traditionaltechniques are to some extent insufficient. General search engines simply rank newsdocuments according to their understanding of query relevance, but they are not quitecapable of handling ambiguous intentioned queries. In many cases, even if the rankeddocuments could be in a satisfying order, readers are tired of navigating all data in themassive collection: they would like to monitor the evolution trajectory of hot news bysimply brief browsing. Summarization is an ideal way out for such dilemma, providingcondensed, informative document reorganization for faster and better representationof news evolution. Our proposed Timeline temporally summarizes evolving news as aseries of individual but correlated component summaries along the temporal dimension,and hence offers an option to understand the big picture of a developing situation.To summarize, the contribution of this paper includes:1. We propose a novel text segmentation method especially for news understanding.As we are facing with much larger corpus compared with traditional summariza-tion tasks, before we get started to summarization, we propose to conduct somenews pre-processing. We extract text snippets representing atomic “events” fromnews documents. As news articles are not indivisible, they always contain morethan one event, where each event denotes an aspect of a news topic. Events withinthe same news document are sometimes independent from each other. Therefore not all of them are equally relevant to the particular news topic. After the fine-grained event distilling procedure, we compress the corpora by discarding non-event descriptions and filtering those snippets non-relevant to any of the topicwords. The challenge for snippet extraction is apparent due to the complicatednatural language discourse structure and the use of rich event-oriented features,such as semantic (similarity, named entities, temporal distance), syntactic (con-junctions, sentence offsets), and layout elements to segment boundaries.2. We introduce a novel framework for the web mining service Evolutionary Time-line Summarization (ETS). Taking a news collection as input, the system auto-matically outputs a timeline with items of component summaries which repre-sent evolutionary trajectories on specific dates. We propose two ways to solvethe problem: one is based on a optimized combination of global biased rank-ing framework and local biased ranking framework with inter-date dependenciesand intra-date dependencies respectively. Particularly, the inter-date dependencycalculation includes temporal decays to project sentences from all dates onto atime horizon. The second one is proposed by a balanced optimization frameworkthrough iterative substitution from a set of sentences to a subset of sentencesunder constraints: the component summaries are not assumed to be completelyisolated because neighboring summaries are generated inter-dependently due tonews characteristics over time. We have double criteria to evaluate the qualitiesof component summaries: both locally, i.e., based on temporally adjacent neigh-bors, and globally, i.e., based on the whole collection.3. We propose an iterative reinforcement approach for the summarization problemof Visual Timeline Summarization (VTS) and for the first time we introduce theconcept of visual timelines. Given the massive collection of time-stamped webdocuments related to a general news subject, the system automatically outputsa visual timeline in items of component summaries with texts and images asmutual descriptions. Component summaries, iteratively refined by global infor-mation, represent evolutionary trajectories across dates. Images, as the hints tosummarize sentences, will alter the traditional way of textual summarization, andhence is beneficial. For the VTS problem, we utilize two heterogeneous streamsof contents, where images have long been overlooked in summarization works.Besides, as we have heterogeneous sources of texts and images, it is challengingto bridge over the semantic gap across the two modalities between each other.As component summaries have two parts, texts and images, the choice of images will have influence on the text selection and vice versa. we propose an effectiveapproach to ensure that both counterparts of texts and images within the gener-ated timeline are appropriately matched by using the mutual reinforcement, andformulate the problem into a global-to-local scenario, i.e., to use global timelinesummary to refine local component summaries iteratively.4. We provide two possible extensive characteristics to incorporate into evolutionarytimeline summarization. The first one is to combine general timeline with userpersonalization. Since users may have potential bias on what they prefer to readdue to their individual interests and obviously a universal summary for all usersis not satisfactory, we introduce a mechanism of Interactive Personalized Sum-marization (IPS), by using”click” and”examine” between readers and contents.The human-system interaction supports clicking into the sentences and examin-ing source contexts for the real-time pseudo feedbacks. The implicit clickthroughdata of user clicks indicates what they are interested in. User click data is oftensparse but we amplify these tiny hints of user interest by “click smoothing”. Thesecond possible direction is incorporate mass media focus from the online socialnetwork services Twitter, to capture the general interests of the society as auxil-iary information. The Twitter system is not simply made up of a set of tweets:there are latent networks including the following relationships among users andthe retweeting linkage. The information of the mass focus should be popularand avoid redundancy. We utilize a unified co-ranking framework, i.e., rankingvertices of tweets and twitter users based on the heterogenous graph, and fusespopularity and diversity simultaneously in the random walk paradigm.
Keywords/Search Tags:Evolutionary summarization, timeline, balanced optimization, mutualreinforcement, news digestion, snippet extraction
PDF Full Text Request
Related items