Font Size: a A A

Based On Rules And Statistics Of Chinese Automatic Word Segmentation

Posted on:2011-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:D LiFull Text:PDF
GTID:2208330332973054Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the internet, Digital information increase rapidly, people have become pay more attention to Chinese Information Processing system day by day. At the same time, modern Chinese has become more and more significant. Automatic Chinese segmentation and name entity recognition are basic research projects in natural language processing and computational linguistics. Its research and application have great theoretical and practical significance. The research on automatic Chinese segmentation and name entity recognition are of great benefit to many applied areas, such as machine translation, semantic analysis, parsing, speech recognition, information retrieval, information filtering and so on. So the demand on automatic natural language processing becomes indispensable.Comparing with other languages, automatic Chinese segmentation and name entity recognition have its own difficulties. We consider that there are two factors to affect the speed of the words auto-segmentation:1 the difference meaning syllables of words; 2 the proper noun of Chinese name,the name of place,the name of department and so on. At present, the results of automatic Chinese segmentation and name entity recognition are still not quite satisfying. In this paper, Chinese word segmentation and Chinese names recognition have been studied separately. And presents a Chinese word segmentation algorithm combing with word frequency and a method of Chinese name recognition based on Support Vector Machines and transformation-based error-driven learning.Chinese automatic segmentation is an important step in Chinese information processing. It is the foundation in many application fields of Chinese information. At present, three main methods have been used for automatic Chinese segmentation, which include rule method, statistical method and understanding method. Through analyzing the existed automatic segmentation methods, this paper emphasizes on the research of rule method and statistical method. And presents a Chinese word segmentation algorithm combing with word frequency. The method firstly based on priority of length combining with word frequency to segment short sentence. If any non-matching word strings of the short sentence exist, we apply the improved maximum matching method and reverse maximum matching method combined with entropy rate to segment. Experimental results show that the algorithm improves the accuracy of word segmentation.Recognition of Chinese personal name is emphasis and difficulty for unknown words recognition. If the problem is effectively solved, then it will improve the precision of unknown words recognition. The paper presents a method of Chinese name recognition based on Support Vector Machines (SVM) and transformation-based error-driven learning. Using the transformation-based learning approach to correct the identification results of SVM. Transformation rules effectively deal with the special cases of language phenomenon and improve the performance of SVM. Experiments show that the method is efficient in identifying person names from Chinese texts. In open test, the precision, recall, and F-measure are improved.
Keywords/Search Tags:Chinese Segmentation, mechanical matching method, Chinese names recognition, support vector machines, transformation-based error-driven learning
PDF Full Text Request
Related items