Font Size: a A A

Temporal Analysis of Topics in Time-Stamped Document Sets

Posted on:2012-01-14Degree:Ph.DType:Dissertation
University:University of Nebraska at OmahaCandidate:Chen, WeiFull Text:PDF
GTID:1468390011461918Subject:Information Technology
Abstract/Summary:
Extracting interesting information from large unstructured document sets is a challenging task. In this dissertation, I proposed a systematic work to extract temporal features/patterns from large time-stamped document sets. These temporal features are single or multiple time intervals that contain unusual or high amount of information related to the user's requests. The user's requests are represented by topics that can be basic topic, containing a simple list of weighted keywords, or complex topic, where logical relationships such as and, or, and not are used to build complex topics from basic topics. A concept of presence measure of a topic based on fuzzy set theory is introduced to compute the amount of information related to the topic in the document set. We also introduce the notion of a topic DAG to facilitate an efficient computation of presence measures of complex topics. The methodology of segmentation applied in this research is to provide simplified and/or changed the representation of the time intervals for further analysis. The discrepancy score originated from Scan Statistics is used to measure the goodness of partitioning a time interval into two or more segments. A hot spot of a given topic is defined as a time interval with the highest discrepancy score when the partitioning the whole time period into two parts. Min-different k segmentation is designed to capture a segmentation with up to k segments and the maximum discrepancy score when given a topic and time-stamped document set. This segmentation helps to extract multiple significant time intervals that contain unusually high amount of information related to a given topic from the time stamped document set. We defined the notion of top h hot areas for a given topic as such kinds of intervals and described efficient methods for extracting the top h hot areas of the topic. The proposed methods are illustrated by experiments using the DBLP data and the TDT-Pilot corpus data. The experiments show that the proposed methods are very effective in extracting the temporal features of topics and highlight meaningfulness of contexts surrounding the topic in extracted time intervals.;Keywords: Text Mining, Temporal Mining, Scan Statistics, Hot Spots, Segmentation, etc.
Keywords/Search Tags:Topic, Time, Document set, Temporal, Segmentation, Information
Related items