Font Size: a A A

Study On Feature Extraction And Text Representation Technology In Topic Tracking

Posted on:2006-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:H Z WangFull Text:PDF
GTID:2168360155458193Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the appearance and popularization of the Internet, the amount of information available grows explosively. Under this circumstance, people can hardly get information that they are interested in quickly and correctly. Moreover, information that is relevant to a topic always spreads separately in different time and different place. We can't understand some events roundly while using resent technology. The topic detection and tracking technology is just to meet this need. The initial motivation for research in TDT is to provide a core technology for an envisioned system that would monitor broadcast news and alert an analyst to new and interesting events happening in the world. Topic tracking is a subtask of TDT. It aims at monitoring the stream of news stories to find additional stories on a topic that is identified using several sample stories.According to the characteristic of topic tracking task, we study the feature extraction and text representation technology in it. We study feature extraction methods from different levels. We present two feature extraction methods: word pairs and word clusters. In most of the research on topic tracking, texts are represented in "bag of words". In this paper, we took part of speech in consideration, and proposed a representation method of using word pairs as features (BOP). We used unigram model and vector space model to perform topic tracking. In this paper we use TDT3 corpus as testing corpus. Experimental results show that in the tracking system we selected, using word pairs as text features cannot improve the performance. We also introduced k-means clustering technique in this paper, and used word clusters as text features (BOC). Experimental results show that using word clusters as text features can largely reduce feature dimension, thus greatly improved the efficiency of tracking system.Through observation on stories, we proposed double-vector model. Text is represented with two vectors using named entity recognition technology. While tracking stories, we compute similarities of each vector, and acquire the final score through weighted sum of the two similarities. Tracking system makes judgment according to this score. In order to better remove noise data, we choose TDT4 corpus as testing corpus. Experimental results show that double-vector model can improve the performance of topic tracking, and the use of stop part of speech set also helps to improve system performance greatly.
Keywords/Search Tags:topic tracking, word pair, word cluster, double-vector model, stop part of speech set
PDF Full Text Request
Related items