Research And Implementation Of Chinese Word Segmentation Based On Character Tagging Method

Posted on:2016-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y You

Full Text:PDF

GTID:2308330473957107

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Many C hinese language processing tasks are based on words, so word segmentation is the first step of such systems. In Chinese sentences, whether a subset of characters form a word is based on the context. This is very different from English and other languages because they have space character as words delimiter. Since context is a relatively vague concept, which makes the word segmentation becomes a difficult task. With the maturity of statistical machine learning methods, its applications fields are expanding. C haracter tagging based word segmentation method gets a good performance in word segmentation tasks. Researchers use the idea of part-of-speech tagging to do word segmentation and makes the accuracy of C hinese segmentation method greatly improved.This thesis discussed two models, the maximum entropy model and the linear-chain conditional random fields model, and the details of the derivation of the models are studied. Then we develop two word segmentation methods based on the two models. We propose improvements for the training process and prediction process, which include better formula presentation, multi- thread method and a new predicting method. In addition, the paper discusses the effects on segmentation accuracy of more tags and more features.The results show that multi- thread method can reduce the training time. The results also show the proposed prediction methods is superior to traditional methods in performance, and slightly better than the traditional method in accuracy, which means including more information about the problem into the model will improve the segmentation accuracy. Meanwhile, the results also show that, method with linear-chain conditional random field model has higher accuracy for sequence labeling problem, but its longer training time may limit its application in certain occasions where the model require regular update. Finally, the results show that more expressive features help to improve the segmentation accuracy.

Keywords/Search Tags:

character tagging based word segmentation, maximum entropy, conditional random field, POS tagging, L-BFGS(Limited-memory BFGS)

PDF Full Text Request

Related items

1	Research Of Chinese Word Segmentation Based On Mechanical Matching And Character Tagging
2	A Study On Cambodian Word Method Based On Conditional Random Field
3	Research On The Learning Of Integrating Chinese Word Segmentation With Part-of-Speech Tagging And Domain Adaption Approach
4	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
5	Chinese POS Tagging Employing Maxent And Word Clustering
6	Study Of Chinese POS Tagging Based On Maximum Entropy
7	The Method Of The Vietnamese Lexical Analysis Research
8	Complextext Sequence Labeling With BILSTM And CRF Algorithm Based On Peephole
9	Research Of Named Entity Recognition Based On Conditional Random Fields
10	Research On Chinese Lexical Analysis Model Algorithm Based On Deep Learning