Font Size: a A A

Reserch Of Topic Detection

Posted on:2010-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:K X LeFull Text:PDF
GTID:2178360278452473Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Topic detection aims at finding new topics and relevant news reports of existing topics from large variety of reports. Most detection strategies are based on text clustering algorithm, which uses vector space model to represent news reports and topics, then calculate similarities, and perform clustering based on certain algorithm.Coding Index is a method which performs conversion between key words and codes by utilizing Index algorithm. Single value index is simple in which key words match values one by one. Sequential array index, search tree index and hash index are three type of Single value index which performs well at searching.Three coding indexes algorithms are proposed in this paper to optimize topic detection algorithm, which are quasi-dynamic array index, internal chaining hash bucket index and bidirectional chaining coalesced hash index.Quasi-dynamic array index algorithm combines searching performance of array index and updating performance of binary tree. By applying Quasi-dynamic strategy, it periodically merges binary tree updating index into primary array index, which solves the problem of low updating performance in sequential array index.Internal chaining hash bucket index algorithm achieves separate chaining hash by creating chain within hash bucket, which avoid frequent memory space allocation. Pointer is implemented through combination of bucket number and cell number, which compress pointer size into three bytes.Bidirectional chaining coalesced hash index algorithm create Bidirectional chains and maintain an empty cell chain based on coalesced hash index, which enable fast moving of key words during collision. It maintains updating performance at constant complexity under open address storage strategy, even if load factor is approaching one.A topic detection system with these three coding index algorithm is designed to take experiment. During topic detection test on Chinese corpus of TDT2004, Internal chaining hash bucket index algorithm achieve the best performance, which exceed none coding index topic detection by nearly 20 times.
Keywords/Search Tags:Topic Detection, Vector Space Model, Text Clustering, Coding Index, Hash Function, Quasi-Dynamic Array Index, Internal Chaining Hash Bucket Index, Bidirectional Chaining Coalesced Hash Index
PDF Full Text Request
Related items