Font Size: a A A

Study On The Tibetan Word Segmentation And Named Entity Recognition With Conditional Random Fields

Posted on:2014-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y C LiFull Text:PDF
GTID:2268330425970660Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Tibetan word segmentation (TWS), and named entity recognition(NER) is an important problem in Tibetan information processing. TWS is used to segment raw Tibetan sentence into word sequence, while NER is used to recognition entities in the word sequence which classified at the same time. The traditional method of Tibetan word is rule-based, which has a poor performance in unknown word and ambiguity. The Tibetan NER’s research foundation is weak, mainly concentrated on the rule-based method. Existing TWS, NER method based on statistics, which usually as a secondary method, in recent three years, method with large-scale corpus and machine learning be taken seriously.The paper systematically studied TWS and NER based on conditional random fields (CRF), research and implements a Tibetan word segmentation system based on CRF. We proposed a method which combines maximum entropy and conditional random fields to identify Tibetan person names. Work includes:The paper propose a Tibetan abbreviated word recognition (AWR) method based on statistical methods, and experiments with CRF, the result indicate that AWR problem has no significant effect on the TWS. Tibetan character is encoded as alphabetic writing, Tibetan word composed of syllable, TWS is combines the continuous syllable sequences into a word sequence. AWR affects the recognition of syllable, thereby reducing the effect of TWS. The statistical AWS method treats AWS as a classification problem, using a machine learning method for classifying. Compared to the rule-based method, our approach does not require vocabulary support, and can be combined easily with the segmentation model based on statistical model, which significantly increases the effect of the Tibetan word segmentation.Determining suitable forms of syllable tagging system, our system outperforms the previous system in the literature. The TWS method with syllable tagging treats TWS as determining the position a syllable in the word, the tagging system greatly affected the TWS. The paper proposes a four position tagging system,"BMES", which combines with AWS model significantly increases the effect of the Tibetan word segmentation. In a comparison experiment, this system outperforms the previous systems in the literature.We systematically study the feature selection, unknown word recognition on the CRF system. Select the appropriate feature is the most important step in the statistical segmentation method, there is rarely literature on the feature selection in TWS with CRF. Our systematically study the different feature of TWS with CRF. Unknown word is a key problem in the word segmentation system. Unknown word recognition (UWR) is an important index in word segmentation system, we study the UWR on the single dataset and crossing dataset, and carried out on the open corpus, and compare the performance of Chinese UWR.The paper proposes a method which combines maximum entropy (ME) and conditional random fields to identify Tibetan named entities, which gets a better performance and balanced the short of precision, recall rate in the two models. There is not have open Tibetan NER corpus, we annotated the Tibet Daily corpus, and experiment the NER with ME and CRF model respectively. A problem in the two models we proposes a method which combines ME and CRF, which achieved good results.
Keywords/Search Tags:Tibetan word segmentation, named entity recognition, abbreviated word recognition, conditional random fields, maximum entropy
PDF Full Text Request
Related items