Font Size: a A A

The Study Of Rule-based Chinese Words Tagging Method

Posted on:2015-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:H D LiFull Text:PDF
GTID:2268330428976088Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Chinese part of speech tagging is one of the basic research topics in the field of Natural Language Processing. It has attracted lots of attention. Part of speech tagging is one of the basic process of shallow layer treatment in natural language. The research results can provide the necessary basis for information extraction, semantic analysis and other high-level processing tasks, and it plays an important role in the practical application of natural language. Therefore, in this thesis, part of speech tagging is the research target. We study on the key problem of part of speech tagging.At present, the accuracy rate of English part of speech tagging is high. We can solve English part of speech tagging problems with traditional statistical models, which is decided by the characteristics of English grammar. The accuracy of Tagging ambiguity words is the decisive factor affecting the accuracy of part of speech tagging. Change of part of speech of English words generally varies with the morphology change, and Chinese words do not change in morphology, which bring great difficulties to our statistical model. The accurate rate of Chinese part of speech tagging is much lower than the English part of speech tagging accuracy rate. Another important factor influencing part of speech tagging accuracy is the treatment of unknown words, what are unknown words? Unknown words are words that are not included in the dictionary of statistical model. When a certain scale of our dictionary arrived, unknown words are mainly some named entities, including the names, place names, organization names and so on. Feature template selection also affects the accuracy of part of speech tagging. A statistical model counts the context information according to the characteristics of the template, so the set of the feature template is also very important. How to solve the above problem is important for Chinese part of speech tagging. In the current Chinese part of speech tagging, there are three main methods, the method based on rules, the method based on statistics and the method of the combination of rules and statistics. The third kinds of methods which combine advantages and disadvantages of the rule-based and statistics method can be a very good solution to the problem of Chinese part of speech tagging. This thesis is focused on the third methods.This thesis establishes three kinds of traditional statistical models, They are the hidden Markov model, conditional random fields model and maximum entropy model, and those models are used to tag the "people’s Daily" corpus, The tagging results was counted. At the same time, we conduct research on the feature selection of Chinese part of speech tagging, understand the influence of different feature templates on the accuracy of Chinese part of speech tagging, and put forward our own feature selection method. Unknown words are also an important factor influencing Chinese part of speech tagging accuracy. We put forward an unknown word processing strategy, and improve the accuracy of part of speech tagging. At the same time, for the low accuracy of the traditional statistical model for Chinese part of speech tagging, we introduce a Chinese part of speech Tagging rules mining method based on mutual information. The rules are studied, and the rule priority algorithm is used to solve the problem of conflicts of rules. Finally, we combine the method based on rules and statistical model. The experimental results show that the rule mining method can improve the accuracy of part of speech tagging.
Keywords/Search Tags:Part of speech tagging, Mutual information, Feature template, Rules
PDF Full Text Request
Related items