Based On Rules And Statistics Of Chinese Automatic Word Segmentation

Posted on:2011-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:D Li

Full Text:PDF

GTID:2208330332973054

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of the internet, Digital information increase rapidly, people have become pay more attention to Chinese Information Processing system day by day. At the same time, modern Chinese has become more and more significant. Automatic Chinese segmentation and name entity recognition are basic research projects in natural language processing and computational linguistics. Its research and application have great theoretical and practical significance. The research on automatic Chinese segmentation and name entity recognition are of great benefit to many applied areas, such as machine translation, semantic analysis, parsing, speech recognition, information retrieval, information filtering and so on. So the demand on automatic natural language processing becomes indispensable.Comparing with other languages, automatic Chinese segmentation and name entity recognition have its own difficulties. We consider that there are two factors to affect the speed of the words auto-segmentation:1 the difference meaning syllables of words; 2 the proper noun of Chinese name,the name of place,the name of department and so on. At present, the results of automatic Chinese segmentation and name entity recognition are still not quite satisfying. In this paper, Chinese word segmentation and Chinese names recognition have been studied separately. And presents a Chinese word segmentation algorithm combing with word frequency and a method of Chinese name recognition based on Support Vector Machines and transformation-based error-driven learning.Chinese automatic segmentation is an important step in Chinese information processing. It is the foundation in many application fields of Chinese information. At present, three main methods have been used for automatic Chinese segmentation, which include rule method, statistical method and understanding method. Through analyzing the existed automatic segmentation methods, this paper emphasizes on the research of rule method and statistical method. And presents a Chinese word segmentation algorithm combing with word frequency. The method firstly based on priority of length combining with word frequency to segment short sentence. If any non-matching word strings of the short sentence exist, we apply the improved maximum matching method and reverse maximum matching method combined with entropy rate to segment. Experimental results show that the algorithm improves the accuracy of word segmentation.Recognition of Chinese personal name is emphasis and difficulty for unknown words recognition. If the problem is effectively solved, then it will improve the precision of unknown words recognition. The paper presents a method of Chinese name recognition based on Support Vector Machines (SVM) and transformation-based error-driven learning. Using the transformation-based learning approach to correct the identification results of SVM. Transformation rules effectively deal with the special cases of language phenomenon and improve the performance of SVM. Experiments show that the method is efficient in identifying person names from Chinese texts. In open test, the precision, recall, and F-measure are improved.

Keywords/Search Tags:

Chinese Segmentation, mechanical matching method, Chinese names recognition, support vector machines, transformation-based error-driven learning

PDF Full Text Request

Related items

1	Chinese Organization Names Recognition Based On Support Vector Machine
2	A Support Vector Machine And Transformation-based Error Driven Learning Method For Biological Entity Recognition
3	Study On The Automatic Chinese Word Segmentation With Chinese Names Recognation Function
4	Chinese Forum Punctuation Extraction And Recognition,
5	The Extraction Of Synonyms And Hyponyms Based On Multi-resources And Their Application In Chinese Names Disambiguation
6	Study On Recognition Of Off-line Similar Handwritten Chinese Haracters Based On Support Vector Machines
7	The Design And Implementation Of A Chinese Organization Names Retrieval System
8	The Research On The Technology Of Chunk Recognition And Its Implementation
9	Research And Application System Design For Handwriting Chinese Character Recognition
10	Research On The Recognition Of Focus Word In Chinese Question