Research On The Automatic Term Extraction In The Area Of Information Science

Posted on:2012-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:C Gu

Full Text:PDF

GTID:2178330335463319

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

There are few studies on term extraction which take the abstract of paper as corpus. But the abstracts as papers'summary own lots of terms in the field of the subject. We shall absolutely take the abstracts as the corpus in the study of the term extraction. So, this paper intends to do some research on the term extraction by using Conditional Random Fields (CRFs) and Mutual Information (MI) method on the abstracts of the Library and Information Science.This paper firstly introduces the research background, research importance, research bases and the structure of the paper, then shortly summarizes the situation of the study on the term extraction. In the chapter two, the paper introduces some related conception of the term and some characteristics of term, including the field feature, the structure feature and so on.In the chapter three, the paper analyses the representational features of the term, synonymous terms and pre-words and after-words of the term based on statistical data. The representational features include the frequency of term, sequence of the part of speech and frequency of the part of speech. The synonymous terms are analyzed by using the "Edit Distance" method. The pre-words and after-words are found by calculating the words which are before or after the terms. For one thing, these statistical data can be used to investigating the inside of terms; for another the data offers the linguistic knowledge for the research of the term extraction. Then, the paper do some research on the term extraction by using Mutual Information method. It introduces the theory of MI and the process of the disposal to the corpus. The study mainly investigates the two-letter word and three-letter word by using the formula of MI, calculating the internal connection of these words, setting different thresholds and then counting the results. Because the results of first experiment are not very good, so the paper adjusts the corpus. After then the accuracy rates increase by a large margin, the highest rate of the two-letter word and three-letter word respectively reaches 58.555% and 58.814%. Although the accuracy has increased, the results are still not very well. The reason causing the results is the own limits of the MI method.At last, the paper discuss the term extraction by using Conditional Random Fields. Firstly, it introduces the theory of CRFs, the process of the disposal to the corpus and the identification of features and model of feature. Secondly, the paper does respective test in the letter based corpus and word based corpus by the model of simple feature, the model with part of speech, and the model with linguistic features. The average F-score in the four tests respectively reaches 91.927%,90.311%,90.681% and 90.6818%. These results indicate that the CRFs is better than the MI model in the term extraction.

Keywords/Search Tags:

Term Extraction, CRFs, MI Model, Abstract of Papers

PDF Full Text Request

Related items

1	Based On The Same Field Crfs And Interdisciplinary Under Brand Word Extraction
2	Research And Implementation On The Technique Of Citation Labeling Based On CRFs Model
3	Abstract Sentence Classification And Frequent Pattern Mining For Scientific Papers Oriented To An English Writing Assistant System
4	Design Of Automatic Term Extraction System And Study Of Key Techniques
5	The Semantic Annotations Based On T-CRFs Model In The Application Of Intelligent Question Answering System Research
6	Ccd-based Terminology Extraction Study
7	Domain Knowledge Acquisition
8	Research On Smart Search And Abstract Extraction Technologies Based On Large Database
9	The Study Of Automatic Chinese Term Extraction
10	A Study On The Chinese Term Extraction