Font Size: a A A

Automatic term extraction and document similarity in special text corpora

Posted on:2003-02-07Degree:M.C.ScType:Thesis
University:Dalhousie University (Canada)Candidate:Dong, LiFull Text:PDF
GTID:2468390011486570Subject:Computer Science
Abstract/Summary:
The first objective of this thesis is to evaluate the performance of the C-value/NC-value methods, which are state-of-the-art methods for automatic term extraction in special text corpora, on a corpus composed of computer science articles and compare it with its published performance on a medical corpus. The C-value/NC-value method can automatically extract multi-word terms from special text corpora and can handle nested terms. It has been experimentally confirmed to outperform previously published automatic term extraction methods on a medical corpus. The second objective of the thesis is to use the extracted terms as features to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model based on TF-IDF. Precision of the term-based method is evaluated and compared with the standard bag-of-words approach, as well as with a link-based method, which estimates the similarity of papers based on the overlap of their local neighborhoods in the citation graph.
Keywords/Search Tags:Automatic term extraction, Special text, Similarity, Method
Related items