A Research On Out-of-vocabulary Chinese Specific Term Identification

Posted on:2016-03-21

Degree:Master

Type:Thesis

Country:China

Candidate:Q Chen

Full Text:PDF

GTID:2308330461969667

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

Term is a language unit that represents basic concepts of a specific subject areas, reflecting the core knowledge of the field. With the development of technology, massive literatures emerging, the term of all kinds of subject areas is also in the continuous development. The traditional way of acquiring terms artificially is no longer feasible. Automatic acquisition of terms by computer becomes the development trend.Automatic term recognition should consider the term unithood and termhood simultaneously, which is more difficult compared to the unknown words recognition of general field. At the same time, low-frequency words and bursty word may herald a new research direction and research focus, it is particularly important for grasping the development and change of a discipline, but it is not getting enough study because it is a tough task.This paper takes the systems biology area for instance, designing a method combining rules and statistics to recognize unknown terms. After analysing the characteristics of terms in the corpus, we selects 6 candidate features to describe terms’ statistical and linguistic features, designing two group of CRFs models respectively, one takes Chinese character as basic feature and the other takes Chinese word as basic feature. These models are applied to the domain corpus. This study evolvingly explore the effect of each feature and the combination of features on term recognition, and proposes a more adjust methods of discretization of feature values. Finally, we get the features combination which possess the best effect, that is, word feature, POS feature, word’s length feature, correlation feature and information entropy feature. After 5 random test, we get the model’s evaluating indicators:the final recall rate is 87.22%, the accuracy rate is 97.53%, the discovery rate is 80.58%. In terms of the identification erroneous results of the CRFs model, we summarize two afterprocessing rules. After applying these rules, all evaluating indicators rise, the final recall rate and accuracy rate are 95.59% and 99.59%, the discovery rate is 83.65%. Through analysing the low-frequency terms’ identification results individually, we find this recognition model also possess a certain ability to find low-frequency terms.

Keywords/Search Tags:

Out-of-vocabulary word identification, Conditional random fields model, Afterprocessing rule, Low-frequency word

PDF Full Text Request

Related items

1	Research Of Chinese Word Segmentation With Conditional Random Fields
2	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
3	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
4	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
5	Application Of Conditional Random Fields In Mongolian Word Segmentation
6	Research Of Named Entity Recognition Based On Conditional Random Fields
7	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
8	A Random Conditional Fields Based Method To Chinese Word Sense Disambiguation Research
9	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
10	Research Of Chinese Word Segmentation With Conditional Random Fields And Implementation