Font Size: a A A

Research On The Identification For Chinese Named Entity Based On Combination Of Rules And Statistic Analysis

Posted on:2013-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:P YanFull Text:PDF
GTID:2248330377958330Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Chinese named entity recognition is a foundational task in Chinese information processing. It is the key technique in many Chinese information processing applications, such as text understanding, text proofing, text clustering, text mining, text filtering, information extraction and machine translation. Therefore, it is important for lexical analysis, syntax analysis, semantic analysis or Chinese information processing to make researches on Chinese named entity recognition.This paper is concerned primarily with Chinese personal name recognition automatically in allusion to contemporary Chinese language. After making statistical analysis on personal name sample set and personal name corpus. Besides, we study emphasisly the statistical laws of the context of the first300surnames as single word and the part—of-speech laws of each surname. This paper presented a Chinese named entity recognition system that combined the statistics-based and rules-based method. The mainly work is as follows.This paper analyzes the difficulty of Chinese personal name recognition, makes introduction to existing approaches, and makes comparison among these approaches. Then we build some linguistics resource, such as personal name sample set, surname set and personal name corpus. After making statistical analysis on them, we also build personal name words list, probability list of surnames, context information list of personal name, prefix and suffix list of surnames etc, which are necessary for the process of recognizing personal name in text. The recognition model implementation approach is:the first is to test the text pretreatment, that is the main use is improving the reverse maximal matching algorithm dictionary, the increase the speed of the slit, secondly, the probability and statistics and the method of combining the rules for its identification. At the same time for even produce the intersection of ambiguity is introduced into the algorithm of the mutual information.To certain conditions about even of the automatic identification problem solved. Therefore, the improved the recognition method name for this word segmentation system performance mention was improved. Through the tests found that this model named entity recognition accuracy and the recall rate reached the higher standard, it is able to use Chinese syntactic analysis system contains named entity sentences on proper analysis. All in all, the model has certain research meaning and applied worthiness.
Keywords/Search Tags:named entity, rules, probability statis
PDF Full Text Request
Related items