Font Size: a A A

A Study On The Chinese Term Extraction

Posted on:2011-08-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhouFull Text:PDF
GTID:1118360302998779Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the carriers of domain knowledge, the creation, popularization and extinction of terms show the dynamical development and evolution process of a subject. Taking the part of knowledge source, domain term databases could offer a convenient and quick manner to acquire professional knowledge. Automatic term extraction is not only one of the critical technology of domain term database construction, but also a basic topic in nature language processing, and provides support with many other researches, such as machine translation, information retrieval, automatic abstrcting, text classification, dictionary compilation and so on.In this dissertation, the author makes a breakthrough at the restriction of noun phrases, accepts more different structures, and widens the linguistical rules. Combining with empirical analysis and machine learning strategies, the researches focus on term structure integrality, domain relevance and collocation, and get the following achievement:Firstly, a computer term database containing more than 40,000 items is constructed, which takes word as the minimal linguistical unit. Based on the distribution features of terms with different length, some morphological rules of term structure are concluded by machine learning methods. As the result of enriching the linguistical rules, the coverage of rules is enlarged and the recall is improved.Secondly, a single-word term recognition approach based on fuzzy clustering is proposed, according to the simple structure and the unambiguous boundary. The recognition process is turned into classification task. Dispensing with specific dictionaries and many other corpora, the single-word terms could be automatically tagged by the clustering algorithm.Thirdly, a substring reduction algorithm based on the independency statistic is proposed to estimate the structure integrality of candidates. Unlike the current methods adopting the mapping relations from single parent-string to many substring, this algorithm attempts to catch the links between a string with its parent-strings. Validated by the experiments,29.44%of the candidates are filtering in time. Besides of the ordinary fragmentary substrings, the common substring noisy can also be recognized.Fourthly, a conception of word active degree is proposed to evaluate the collocation ability of non-noun words. Integrated with cohesion between words, the parameter could measure the collocation appropriateness of the words in a phrase and delete the ill-collocation or the phrases with excessively active segment. Validated by the experiments, the WAD has a strong ability to distinguish the errors cased by verb-object phrases and preposition phrases, and the precision reaches 99.97%.Finally, according to the distribution diversity between terms and non-terms, a domain relevance measure based on the local distribution variety feature is proposed, combining with the whole coverage feature. Validated by the experiments, this method could efficiently improve the rank of low-frequency term and base term with low computational complexity.
Keywords/Search Tags:Term Extraction, Multi-structure Term, Morphological Structure Pattern, Substring Reducation, Word Active Degree, Collocation Test, Termhood
PDF Full Text Request
Related items