Research On Chinese Word Segmentation Technique For Patent Documents

Posted on:2011-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:D S Liu

Full Text:PDF

GTID:2178360302988510

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the information carrier of recording and transmitting various science and technology productions, patent documents are also the largest technical information sources. For solving the problem that how to use it to make the patents play an important role in many aspects of the research and patent business, patent information processing system emerges. Patent documents word segmentation is an important fundamental part in Chinese patent information processing system, information processing applications such as patent retrieval and patent translation should be based on the patent word segmentation.Now, research on Chinese word segmentation concentrates on news text, the state-of-the-art system can obtain high segmentation result in news text, but not ideal in the patent corpus. Aiming at the characteristics and difficulty of automatic segmentation of the patent documents, this thesis presents an approach for word segmentation based on statistics and rules. This method fully utilizes the global information from a large scale corpus and the context information of the segmenting text, and solves in effect the problem of the unknown words difficult to identify in the patent word segmentation. The experimental results indicate that this method achieves a good effect in the opening test, and also has good effect on unknown words recognition.To overcome the deficiency that the supervised learning method needs a large scale of the same sources training corpus and obtains the word context information by tuning the window size, this thesis takes the high frequency words existing in a large-scale corpus and the context information of the segmented text as the auxiliary features to be introduced to the segmentation system based on conditional random fields, and then proposes a word segmentation approach incorporating unsupervised segmentation information into conditional random fields. Contrast experiment results with the state-of-the-art system in the patent documents corpus indicate that our approach addresses the problem of the lack of the training corpus, and gets more word boundary information in statistics, and makes the performance of the word segmentation rises about 7%. Aiming at the latent hierarchical structure of the multi-word terms in the patent documents, this thesis does exploratory study in the level segmentation of the multi-word terms on the basis of analyzing the word-formation characteristic of the patent terms.

Keywords/Search Tags:

Chinese word segmentation, Patent documents, Machine learning, Contextual information, Conditional random fields

PDF Full Text Request

Related items

1	Research Of Chinese Word Segmentation With Conditional Random Fields
2	Chinese Word Segmentation For Patent Documents
3	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
4	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
5	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
6	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
7	The Research On Chinese Word Segmentation Based On Conditional Random Fields In Big Data Environment
8	Research On Key Technologies Of Word Segmentation In Chinese Patent Documents
9	The Key Technology On Chinese Word Segmentation Based On Bi-LSTM-CRF Model
10	Research Of Named Entity Recognition Based On Conditional Random Fields