Automatic term extraction and document similarity in special text corpora

Posted on:2003-02-07

Degree:M.C.Sc

Type:Thesis

University:Dalhousie University (Canada)

Candidate:Dong, Li

Full Text:PDF

GTID:2468390011486570

Subject:Computer Science

Abstract/Summary:

The first objective of this thesis is to evaluate the performance of the C-value/NC-value methods, which are state-of-the-art methods for automatic term extraction in special text corpora, on a corpus composed of computer science articles and compare it with its published performance on a medical corpus. The C-value/NC-value method can automatically extract multi-word terms from special text corpora and can handle nested terms. It has been experimentally confirmed to outperform previously published automatic term extraction methods on a medical corpus. The second objective of the thesis is to use the extracted terms as features to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model based on TF-IDF. Precision of the term-based method is evaluated and compared with the standard bag-of-words approach, as well as with a link-based method, which estimates the similarity of papers based on the overlap of their local neighborhoods in the citation graph.

Keywords/Search Tags:

Automatic term extraction, Special text, Similarity, Method

Related items

1	An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measure
2	Design And Implementation Of A Automatic Scoring System Based On Text Similarity For Subjective Questions
3	Research On Text Similarity Calculation Method And Its Application In Financial Field
4	Design Of Automatic Term Extraction System And Study Of Key Techniques
5	The Study Of Automatic Chinese Term Extraction
6	Research On The Generation Of Automatic Summarization In Chinese From Web
7	Research On Automatic Extraction Of Chinese Terms
8	A Method For Text Similarity Measurement With TF-IDF And Word Semantic Information
9	Research On Automatic Scoring Method Of Short Text Subjective Questions Based On Text Similarity
10	The Research And Implementation Of Text Similarity Calculation Based On Feature Extraction