Font Size: a A A

A Research On Out-of-vocabulary Chinese Specific Term Identification

Posted on:2016-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:Q ChenFull Text:PDF
GTID:2308330461969667Subject:Information Science
Abstract/Summary:PDF Full Text Request
Term is a language unit that represents basic concepts of a specific subject areas, reflecting the core knowledge of the field. With the development of technology, massive literatures emerging, the term of all kinds of subject areas is also in the continuous development. The traditional way of acquiring terms artificially is no longer feasible. Automatic acquisition of terms by computer becomes the development trend.Automatic term recognition should consider the term unithood and termhood simultaneously, which is more difficult compared to the unknown words recognition of general field. At the same time, low-frequency words and bursty word may herald a new research direction and research focus, it is particularly important for grasping the development and change of a discipline, but it is not getting enough study because it is a tough task.This paper takes the systems biology area for instance, designing a method combining rules and statistics to recognize unknown terms. After analysing the characteristics of terms in the corpus, we selects 6 candidate features to describe terms’ statistical and linguistic features, designing two group of CRFs models respectively, one takes Chinese character as basic feature and the other takes Chinese word as basic feature. These models are applied to the domain corpus. This study evolvingly explore the effect of each feature and the combination of features on term recognition, and proposes a more adjust methods of discretization of feature values. Finally, we get the features combination which possess the best effect, that is, word feature, POS feature, word’s length feature, correlation feature and information entropy feature. After 5 random test, we get the model’s evaluating indicators:the final recall rate is 87.22%, the accuracy rate is 97.53%, the discovery rate is 80.58%. In terms of the identification erroneous results of the CRFs model, we summarize two afterprocessing rules. After applying these rules, all evaluating indicators rise, the final recall rate and accuracy rate are 95.59% and 99.59%, the discovery rate is 83.65%. Through analysing the low-frequency terms’ identification results individually, we find this recognition model also possess a certain ability to find low-frequency terms.
Keywords/Search Tags:Out-of-vocabulary word identification, Conditional random fields model, Afterprocessing rule, Low-frequency word
PDF Full Text Request
Related items