Study On Feature Extraction And Text Representation Technology In Topic Tracking

Posted on:2006-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:H Z Wang

Full Text:PDF

GTID:2168360155458193

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the appearance and popularization of the Internet, the amount of information available grows explosively. Under this circumstance, people can hardly get information that they are interested in quickly and correctly. Moreover, information that is relevant to a topic always spreads separately in different time and different place. We can't understand some events roundly while using resent technology. The topic detection and tracking technology is just to meet this need. The initial motivation for research in TDT is to provide a core technology for an envisioned system that would monitor broadcast news and alert an analyst to new and interesting events happening in the world. Topic tracking is a subtask of TDT. It aims at monitoring the stream of news stories to find additional stories on a topic that is identified using several sample stories.According to the characteristic of topic tracking task, we study the feature extraction and text representation technology in it. We study feature extraction methods from different levels. We present two feature extraction methods: word pairs and word clusters. In most of the research on topic tracking, texts are represented in "bag of words". In this paper, we took part of speech in consideration, and proposed a representation method of using word pairs as features (BOP). We used unigram model and vector space model to perform topic tracking. In this paper we use TDT3 corpus as testing corpus. Experimental results show that in the tracking system we selected, using word pairs as text features cannot improve the performance. We also introduced k-means clustering technique in this paper, and used word clusters as text features (BOC). Experimental results show that using word clusters as text features can largely reduce feature dimension, thus greatly improved the efficiency of tracking system.Through observation on stories, we proposed double-vector model. Text is represented with two vectors using named entity recognition technology. While tracking stories, we compute similarities of each vector, and acquire the final score through weighted sum of the two similarities. Tracking system makes judgment according to this score. In order to better remove noise data, we choose TDT4 corpus as testing corpus. Experimental results show that double-vector model can improve the performance of topic tracking, and the use of stop part of speech set also helps to improve system performance greatly.

Keywords/Search Tags:

topic tracking, word pair, word cluster, double-vector model, stop part of speech set

PDF Full Text Request

Related items

1	Improving Word Vector Model With Part-of-Speech And Dependency Grammar Information
2	Research On Chinese Part-of-speech Tagging Based On Semi Hidden Markov Model
3	Chinese Word Found Its Part Of Speech Tagging
4	Research On Enhanced Word Embedding Learning Model With Fusion Of Part-of-Speech And Position Information
5	Automatic Topic Labelling Based On Word Vectors
6	The Effect Of Part Of Speech On Chinese Word Segmentation
7	Research On Sentence Alignment Based On Word Pair And Word Dictionary
8	Study On Disambiguation Algorithm For Chinese Word Segmentation
9	Topic Model For Short Texts Based On Word Triangles
10	Research On Uyghur Recognition Technology Based On Word Part