Font Size: a A A

Chinese Term Extraction In Specific Domain

Posted on:2012-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:D LiFull Text:PDF
GTID:2218330368988090Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Term is a language unit that represents the basic concepts of specific subject areas, reflecting the core knowledge in the field, convenient for people to get professional knowledge rapidly. With the development of technology, all kinds of new knowledge constantly emerging, the corresponding term is also in the continuous development, as a lot of digital information material having been produced in the era of information explosion, the traditional method of accessing terms by human becomes no longer feasible. How to automatically get term becomes a hot research naturally. Automatic terminology extraction is one of important research tasks in the information processing field and has important applications in fields of the dictionary compilation, domain ontology construction, machine translation, etc.Current term extraction methods are commonly used rule-based method, based on statistical methods, the method of combining statistical and rule. Based on statistical methods, according to having the labeled corpus or not, can be divided into supervised statistical machine learning methods and unsupervised methods based on statistics. Due to the lack of labeled corpus, predecessors did a little research on term extraction method that basing on the statistical machine learning. In this paper, we study the term extraction method in specific areas, analyzing the characteristics of domain terms, comparing the difference between its and the named entity. For the automotive sector has developed labeling rules, tagging of corpus. The precision, recall and F-measure of term extraction based on CRFs are 86.41%,80.50%, 82.50% respectively.Against its trouble to label domain corpus by manual, this article will introduce active learning strategies into term extraction methods based on Conditional Random Field. Use the uncertainty sample selection strategy of active learning, combining with the conditional probability calculated confidence that come from CRFs module, Experimental results show that results obtained from using of active learning methods to increase sample are better than increasing sample size randomly, with less tagged corpus to get the desired effect.Based on supervised statistical machine learning methods can obtain better results, but has a lot of dependence on the scale and quality of the tagged corpus, this paper studies unsupervised statistics based domain term extraction method. The paper analyzes the performance of information entropy, mutual information, C-value in domain term extraction, combined with the formed part of speech rules of terms to filter and improve the accuracy of term extraction. The final F-score is 15.41%.This paper gives the approaches of domain term extraction. The statistic method needs the minimum resource but its result is not good. The CRFs method achieves the best result, while method based on active learning and CRFs gets the similar result but needs less tagged corpus compared with approach not use active learning.
Keywords/Search Tags:Domain Term, Term Extraction, Conditional Random Fields, Active Learning, Statistic method
PDF Full Text Request
Related items