Font Size: a A A

Research On Chinese Word Segmentation Technique For Patent Documents

Posted on:2011-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:D S LiuFull Text:PDF
GTID:2178360302988510Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the information carrier of recording and transmitting various science and technology productions, patent documents are also the largest technical information sources. For solving the problem that how to use it to make the patents play an important role in many aspects of the research and patent business, patent information processing system emerges. Patent documents word segmentation is an important fundamental part in Chinese patent information processing system, information processing applications such as patent retrieval and patent translation should be based on the patent word segmentation.Now, research on Chinese word segmentation concentrates on news text, the state-of-the-art system can obtain high segmentation result in news text, but not ideal in the patent corpus. Aiming at the characteristics and difficulty of automatic segmentation of the patent documents, this thesis presents an approach for word segmentation based on statistics and rules. This method fully utilizes the global information from a large scale corpus and the context information of the segmenting text, and solves in effect the problem of the unknown words difficult to identify in the patent word segmentation. The experimental results indicate that this method achieves a good effect in the opening test, and also has good effect on unknown words recognition.To overcome the deficiency that the supervised learning method needs a large scale of the same sources training corpus and obtains the word context information by tuning the window size, this thesis takes the high frequency words existing in a large-scale corpus and the context information of the segmented text as the auxiliary features to be introduced to the segmentation system based on conditional random fields, and then proposes a word segmentation approach incorporating unsupervised segmentation information into conditional random fields. Contrast experiment results with the state-of-the-art system in the patent documents corpus indicate that our approach addresses the problem of the lack of the training corpus, and gets more word boundary information in statistics, and makes the performance of the word segmentation rises about 7%. Aiming at the latent hierarchical structure of the multi-word terms in the patent documents, this thesis does exploratory study in the level segmentation of the multi-word terms on the basis of analyzing the word-formation characteristic of the patent terms.
Keywords/Search Tags:Chinese word segmentation, Patent documents, Machine learning, Contextual information, Conditional random fields
PDF Full Text Request
Related items