Font Size: a A A

Research On The Automatic Term Extraction In The Area Of Information Science

Posted on:2012-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:C GuFull Text:PDF
GTID:2178330335463319Subject:Information Science
Abstract/Summary:PDF Full Text Request
There are few studies on term extraction which take the abstract of paper as corpus. But the abstracts as papers'summary own lots of terms in the field of the subject. We shall absolutely take the abstracts as the corpus in the study of the term extraction. So, this paper intends to do some research on the term extraction by using Conditional Random Fields (CRFs) and Mutual Information (MI) method on the abstracts of the Library and Information Science.This paper firstly introduces the research background, research importance, research bases and the structure of the paper, then shortly summarizes the situation of the study on the term extraction. In the chapter two, the paper introduces some related conception of the term and some characteristics of term, including the field feature, the structure feature and so on.In the chapter three, the paper analyses the representational features of the term, synonymous terms and pre-words and after-words of the term based on statistical data. The representational features include the frequency of term, sequence of the part of speech and frequency of the part of speech. The synonymous terms are analyzed by using the "Edit Distance" method. The pre-words and after-words are found by calculating the words which are before or after the terms. For one thing, these statistical data can be used to investigating the inside of terms; for another the data offers the linguistic knowledge for the research of the term extraction. Then, the paper do some research on the term extraction by using Mutual Information method. It introduces the theory of MI and the process of the disposal to the corpus. The study mainly investigates the two-letter word and three-letter word by using the formula of MI, calculating the internal connection of these words, setting different thresholds and then counting the results. Because the results of first experiment are not very good, so the paper adjusts the corpus. After then the accuracy rates increase by a large margin, the highest rate of the two-letter word and three-letter word respectively reaches 58.555% and 58.814%. Although the accuracy has increased, the results are still not very well. The reason causing the results is the own limits of the MI method.At last, the paper discuss the term extraction by using Conditional Random Fields. Firstly, it introduces the theory of CRFs, the process of the disposal to the corpus and the identification of features and model of feature. Secondly, the paper does respective test in the letter based corpus and word based corpus by the model of simple feature, the model with part of speech, and the model with linguistic features. The average F-score in the four tests respectively reaches 91.927%,90.311%,90.681% and 90.6818%. These results indicate that the CRFs is better than the MI model in the term extraction.
Keywords/Search Tags:Term Extraction, CRFs, MI Model, Abstract of Papers
PDF Full Text Request
Related items