Font Size: a A A

Statistical Chinese Lexical Analysis And Its Reinforcement Learning Mechanism

Posted on:2008-05-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:W JiangFull Text:PDF
GTID:1118360245996575Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Lexical Analysis (LA) is a foundational task of natural language processing (NLP), so it greatly influences the Syntactic Analysis and successive applications of LA. In this text, LA includes the Word segmentation, Part-of-speech (POS) tagging and Named Entity Recognition (NER). As a prerequisite part, early error in LA will cascade through the chain, causing the whole effect on the final performance, such as the performance of Information retrieval, Question Answer System and Machine Translation. In another side, the approaches and the techniques in LA are helpful to solve the similar task, such as Pinyin-to-character conversion, shallow parsing, and biological information processing. So this work is a valuable and meaningful task.The main difficulties to improve LA include ambiguity problem, sparse data problem and independent identical distribution (iid.) assumption. This dissertation is focus on the LA task, and research with the statistic approach. In terms of the model: 1) As for Supervised learning, we explore the N-gram, Maximum Entropy Model (ME), Conditional Random Fields (CRF) and Support Vector Machine (SVM) etc. 2) As for unsupervised learning, we build Word Vector Space. In terms of the feature: we propose to extract complicated features by the Rough set theory, and to extract the named entity features by the trigger pair method. And we do deep research in LA with above theories and approaches. The dissertation concerns the following aspects:1) Build Chinese POS tagging model based on CRF. HMM is a generative model, so it is not easily added the rich features. Maximum Entropy Markov Model (MEMM) is conditional probabilistic model, easily to fuse rich features, while suffers from label bias problem. CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence, therefore, not only rich features can be fused into this model, but also label bias problem can be overcome. In addition, we apply the trigger pair features to implement the long distance dependency, and explore the feedback influence of Chunking feature to the POS tagging. We describe a method to build the sequential tagging task based on SVM, and apply it into the Pinyin-to-character conversion. Finally, a method of the Multi-model combination in the POS tagging is described. 2) Research on the Chinese NER based on ME. ME is a conditional probabilistic model, and easily to fuse rich features. Recently evaluation seems to have indicated that linear or the log-linear model has good performance in NER task. We propose to collect the stable features by making use of the trigger pair method. Furthermore, we explore the feature extending approach by combining the thesaurus with the word cluster. Considering the attribution of Chinese NER, we propose the double layer mixing model, and introduce the domain extended learning strategy, so that the paragraph or the chapter features can be used to improve the performance.3) Propose to apply Rough set theory to extract the complicated contextual features. These features are difficult to be extracted from the corpus by using existing models, especially from the corpus which contains noise and inconsistent samples , for there are more serious sparse data problem and noise problem, when extracting the complicated features. Based on rough sets, the complex and long-distance features are collected effectively. In addition, these rough rules are added into the maximum entropy model, to allocate the weight of all the features according to the whole performance of the model. Furthermore, we apply the variable precison Rough set theory, to improve the performance with the imbalance distribution of all the the decision tags. The experiments have verified the effectiveness of our approaches.4) Research on the Reinforcement Learning (RL) in the LA task. The supervised learning approach based on corpus almost encounters the sparse data problem, and makes iid. assumption. While due to Zipf's law, the sparse data problem can be hardly solved by enlarging the corpora. In another side, the applied field is generally different from that in the training corpora, so iid. is not easily met. In many task, the above two problems bring obstacles to improve the supervised learning algorithm. In the case that the improvement to system based on supervised learning encounters the bottleneck, the RL approach is a meaning research direction. Considering the "local perception property" in the real feedback information, we focus on explore the online learning with "local perception". In this dissertation, we build the Chinese Person Named Recognition based on Clonal Selection Theory, and build a word segmentation, POS tagging and Pinyin-to-Character Conversion model based on reinforcement learning technology.
Keywords/Search Tags:Lexical Analysis, Statistic Language Model, Feature Extraction, Artificial Immune System, Reinforcement Learning
PDF Full Text Request
Related items