Font Size: a A A

Research Into Chinese Word Segmentation Based On Statistic And Regulation

Posted on:2009-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:J F ZhangFull Text:PDF
GTID:2178360272463573Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with information age arrivals, the impact of the computer in people's life has become more and more important. At present, the use and demand of corpus becomes more frequently and highly in the machine translation, speech recognition, information retrieval, and many other areas. The segmentation of corpus is the primary task to establish Chinese corpus resource. With the deep research of Chinese information, the Chinese segmentation has caused considerable attention and become a forefront of a subject of Chinese information processing. After several decades research, the Chinese segmentation has achieved remarkable successes, some practical Chinese Segmentation system appears, which has a high level in accuracy and speed. However, no matter the standard intellectual and the practical needs, it still has a gap in certain degree.In this paper, we propose a method of Chinese segmentation based on the combination of rules and statistic, taking train corpus as the object of study. In this method, according to the problem of Chinese segmentation, we classify it and optimize the segmentation result gradually based on the combination of rules and statistic and get ideal result finally. The main works of this paper includes the below parts:1. Through the analysis and statistics of the massive training corpus, we built the database of ambiguity and analyzed the inner feature and the context environment of ambiguity which has established the linguistics foundation for the solution of ambiguity. At meantime, we analyze the ambiguity, and then establish the database of pseudo-ambiguity and true-ambiguity.2. We count and analyze the different language phenomena and rules of pseudo-ambiguity and true-ambiguity, and make further classification which offer support for the strategy of ambiguity and the establishment of probabilistic model. In addition, we make the Tongyici Cilin as the semantic resource in the processing of the establishing the probabilistic model.3. Through the massive real corpus's analysis and statistics, we extract unknown words recognition. After that, we extract and count inner information of unknown words recognition and establish the database of unknown words recognition. Finally, we use the inner information to establishment the probabilistic model.4. We extract the practical rule of unknown words recognition and establish the database of the rules in order to enhance the result of unknown words recognition.We carry on the model based on Microsoft Research (MSR)'s corpus which offer for the SIGHAN in 2005, and discovered that this strategy has a good effect in ambiguity. In order to examine this method's validity, we participated the Fourth SIGHAN Bakeoff. The experiments show that the performance is satisfying, with the RIV-measure 96.8% in NCC open test in the SIGHAN bakeoff 2007.
Keywords/Search Tags:Chinese Segmentation, Pseudo-ambiguity, True-ambiguity, Probabilistic Model, Unknown Words Recognition
PDF Full Text Request
Related items