Font Size: a A A

Research On Automatic Bilingual Term Extraction Technology For Patents

Posted on:2010-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2178360272485241Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Terminologies concentrate the kernel knowledge in a specific domain, the automatic extraction of the terminologies can help people to conveniently acquire and accumulate the knowledge in the domain, and moreover, bilingual terminologies have the mapping relation between two languages, so the automatic extraction of the bilingual terminologies is widely used in natural language processing, such as machine translation, information retrieval, generation of bilingual dictionaries, etc.As the massive data period comes, the terminology extraction based on statistic generally becomes a focus in research, and especially, a machine learning method achieves a good result in the automatic extraction of the terminologies. On the basis of manually built bilingual terminology annotated corpuses, the paper utilizes Conditional Random Field to automatically extract bilingual terminologies in Chinese and English separately, and the bilingual terminology similarity algorithm based on semantic prediction and disclosed in the paper is used to calculate the similarity of the extracted bilingual terminologies so as to finish the extraction process.The main task of the paper is to summarize the characteristics of the terminologies in patents in Chinese and English and establish the annotated standard of the terminologies so as to distinguish the difference between the terminologies and other words. According to the standard, the terminologies in the patents in Chinese and English are manually annotated so as to build bilingual terminology annotated corpuses. On the basis of the bilingual terminology annotated corpuses, extraction models for bilingual terminologies are respectively trained by Conditional Random Field, and tests, such as characteristic selection, label bit selection, characteristic template selection, etc., are done so as to select the trained model with better extraction effect. The test results show that introducing domain characteristics and using a mark with three bits can effectively improve the effect of the terminology extraction, the extraction F-value of Chinese terminologies is 88.43%, and the extraction F-value of English terminologies is 87.51%.For the breviary of the Chinese terminologies and the paradigmatic phenomena of the English terminologies, the paper proposes the bilingual terminology similarity algorithm based on semantic prediction for calculating the similarity of the extracted bilingual terminologies; the extraction F-value of the bilingual terminologies is 91.57%. An automatic extraction system for bilingual terminologies for a patent, which is modularized and transplanted, is completed according to the algorithm, and corresponding tests are also completed.
Keywords/Search Tags:Bilingual terminology similarity, Conditional random field, Machine learning, Machine translation
PDF Full Text Request
Related items