Font Size: a A A

Automatic Recognition Research On Syntactic Category Of Common Words

Posted on:2013-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaFull Text:PDF
GTID:2248330371977228Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As we all known,The quality of Chinese corpus plays a decisive role in the study of natural language processing. For scholars,the corpus of high quality is more and more significant.So,it is necessary to consider the multi-category words in study of modern chinese.Although,the number of multi-category words in modern chinese is small,multi-category phenomenon is very complicated.The multi-category problem of words is a common exisitence problem,which leads to great difficulties for Part of Speech Tagging.One of the most critical problem in Chinese Speech Tagging is correctly identify the multi-category.In the process of researching the function word usage Automatic identification, further realize the importance of correctly identify the multi-category. This paper mainly research the reconginition on syntactic category of words based on statistical methods(Based on conditional random field model,the maximum entropy model and K-nearest neighbor algorithm of statistical methods).The results of those experiments show that Based on statistical methods can better identify the speech of multi-category words,and have good recognition performance on commonly used multi-category words,what is more,it has achieved a higher accuracy in the corpus, however,not all multi-category words can achieve a recognition results.There are some words that are not suitable for statistical method.In view of this an isolated phenomenon,We can choose rule based methods to identify.On the basis of the statistical results,Writer choose rule based methods deal with some multi-category words that are not suitable for statistical method.According to the different characteristics of different parts of speech, extract some operable characteristics to determine and make use of BNF paradigm to describle the part of speech about multi-category. Firstly, according to the characteristics of the multi-category words and the characteristics in context to build a set of rules.Secondly, To test the rules in the corpus with Annotation tools in order to find out the problem existing in the rules.Modify the rules continually and repeatly test at the same time in order to improve the recognition accuracy rate of rules.The results of those experiments show that applying rules methods for the words that are not suitable for statistical method can get a better recognition accuracy.Finally, this papers have a summary about this study and prospects which is to study in the next step.
Keywords/Search Tags:Chinese information processing, multi-category word, ConditionalRandomFields, MaximumEntropy, k-nearest neighbor
PDF Full Text Request
Related items