A Study On The Chinese Term Extraction

Posted on:2011-08-04

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Zhou

Full Text:PDF

GTID:1118360302998779

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the carriers of domain knowledge, the creation, popularization and extinction of terms show the dynamical development and evolution process of a subject. Taking the part of knowledge source, domain term databases could offer a convenient and quick manner to acquire professional knowledge. Automatic term extraction is not only one of the critical technology of domain term database construction, but also a basic topic in nature language processing, and provides support with many other researches, such as machine translation, information retrieval, automatic abstrcting, text classification, dictionary compilation and so on.In this dissertation, the author makes a breakthrough at the restriction of noun phrases, accepts more different structures, and widens the linguistical rules. Combining with empirical analysis and machine learning strategies, the researches focus on term structure integrality, domain relevance and collocation, and get the following achievement:Firstly, a computer term database containing more than 40,000 items is constructed, which takes word as the minimal linguistical unit. Based on the distribution features of terms with different length, some morphological rules of term structure are concluded by machine learning methods. As the result of enriching the linguistical rules, the coverage of rules is enlarged and the recall is improved.Secondly, a single-word term recognition approach based on fuzzy clustering is proposed, according to the simple structure and the unambiguous boundary. The recognition process is turned into classification task. Dispensing with specific dictionaries and many other corpora, the single-word terms could be automatically tagged by the clustering algorithm.Thirdly, a substring reduction algorithm based on the independency statistic is proposed to estimate the structure integrality of candidates. Unlike the current methods adopting the mapping relations from single parent-string to many substring, this algorithm attempts to catch the links between a string with its parent-strings. Validated by the experiments,29.44%of the candidates are filtering in time. Besides of the ordinary fragmentary substrings, the common substring noisy can also be recognized.Fourthly, a conception of word active degree is proposed to evaluate the collocation ability of non-noun words. Integrated with cohesion between words, the parameter could measure the collocation appropriateness of the words in a phrase and delete the ill-collocation or the phrases with excessively active segment. Validated by the experiments, the WAD has a strong ability to distinguish the errors cased by verb-object phrases and preposition phrases, and the precision reaches 99.97%.Finally, according to the distribution diversity between terms and non-terms, a domain relevance measure based on the local distribution variety feature is proposed, combining with the whole coverage feature. Validated by the experiments, this method could efficiently improve the rank of low-frequency term and base term with low computational complexity.

Keywords/Search Tags:

Term Extraction, Multi-structure Term, Morphological Structure Pattern, Substring Reducation, Word Active Degree, Collocation Test, Termhood

PDF Full Text Request

Related items

1	Research On Domain-Specific Term Extraction Based On Semi-Supervised Learning
2	Research On Terminology Extraction Of Academic Paper Based On Multi-Strategy Method
3	The Research Of Term Relation Extraction Based On Syntax Structure
4	Research Of Chinese Word Segmentation Oriented To Special Domain
5	Chinese Term Extraction In Specific Domain
6	Research On Chinese Relation Extraction For Complex Text Structure
7	Research On Extraction Of Bilingual Multi-word Term Translation Pairs From Comparable Corpora
8	Word Segmentation And Term Extraction From Multi-language Texts
9	Design And Implementation Of Translation Assistance System For Scientific And Technical Literature Based On Automatic Term Extraction
10	An Experimental Analysis Of The Term Structure Of Interest Rate And The Pricing Of Fixed-income Securities