Font Size: a A A

Chinese Word Segmentation For Patent Documents

Posted on:2014-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:J Y YueFull Text:PDF
GTID:2248330395467851Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the social informationize period comes, patent documents are the largest technical information carrier, the patents play an important role in science development and technology innovation. How to use these massive technical information documents is the challenge of the chinese patent information processing system, chinese word segmentation is an important fundamental part in this system, patent retrieval and patent translation are based on it, the efficiency of the word segmentation should be the key of patent documents utilization.Chinese word segmentation and pos tagging have achieved many satisfactory results, but the amount of references which research on chinese word segmentation for patent documents is not so much now, there is not the specialized open-source patent documents segmentation system. According to the characteristics of the patent documents, this paper presents a statistics approach for chinese word segmentation based on domain dictionaries, the precision and recall of this approach for Chinese patent documents segmentation are much higher than the ICTCLAS system, and the unknown words recognition has the remarkable enhacement.To overcome the deficiency of word segmentation accuracy that patent documents have a lot of unknown domain terms, this paper takes the NC-value algorithm to extract the patent domain terms, and Conditional Random Fields (CRF) model is adopted for the low frequency terms by the extraction template, to improve the patent domain terms recognition efficiency. Contrast experiment results with the classic terms extraction algorithm in the patent documents indicate that our approach solve the unknown words recognition issue very well, and makes the performance of the word segmentation rises more than10%.
Keywords/Search Tags:patent documents, Chinese word segmentation, conditional randomfields(CRF), domain terms extraction
PDF Full Text Request
Related items