Font Size: a A A

Based On String Matching And English Mixed The Word Technology Research

Posted on:2012-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2218330374953975Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Word segregation is the process that cutting the string of natural language into words. It is the primary key technical link in natural language processing. The current technology of Chinese word segmentation is working for Chinese text. If we use the traditional techniques based on Chinese word segmentation working for the text which is made Chinese and English mixed, the segmentation result is not satisfactory. So, it is necessary for us to study Chinese and English mixed word segmentation.First, this paper introduces about four basic word segmentation algorithms and discusses the key problems with solutions of word segmentation. Then we introduce the evaluation system of word segmentation system. We propose two evaluation indexes for the Chinese and English mixed word segmentation system.Second, this paper has determined technology and strategy of the model of Chinese and English mixed word segmentation by several experiments. First, doing comparative experiments on the segmentation dictionary mechanism and word segmentation algorithms based on string matching. On this basis, this paper has proposed solutions of Chinese and English mixed word segmentation based on string matching and laid the foundation for the establishment of the model of Chinese and English mixed word segmentation. Second, doing comparative experiments on four segmentation dictionary mechanism: half of the entire word,TRIE index tree,verbatim half,double word hash index. Then this paper selected the segmentation dictionary mechanism of double word hash index for model of Chinese and English mixed word segmentation. Third, doing comparative experiments on the text of Chinese or Chinese and English mixed with Forward Maximum Match Method and Reverse Maximum Match Method. Then this paper selected Reverse Maximum Match Method for model of Chinese and English mixed word segmentation. Forth, this paper has improved Reverse Maximum Match Method. Comparing length of the string to be processed with maximum word length of hash dictionary which led to its double word, then we can set the maximum word length of Reverse Maximum Match Method. This approach can effectively reduce the match times in the process of word segmentation and improve the efficiency of word segmentation.This paper has also analyzed and discussed several key issues in segmentation: disambiguation,identification of unknown words and so on. This paper has made a new disambiguation algorithm. By doing segmentation experiment to the People's Daily corpus, this disambiguation algorithm can get about 96.50% accuracy. Then we take maximum probability method to indentify Chinese Name.Based on the above ideas, this paper established a model of Chinese and English mixed word segmentation based on string matching. The mode has achieved these functions: add dictionary,Chinese and English mixed word segmentation,reserved interface for the extension of word segmentation algorithms. At last, we use evaluation indexes of word segmentation system to evaluate this model. Data show that this model has some reference value.
Keywords/Search Tags:String Match, Word Segmentation, Algorithm, Reverse Maximum Match Method, Disambiguation, Model
PDF Full Text Request
Related items