Font Size: a A A

Chinese Segmentation Algorithm Research Based On Special Identifier

Posted on:2011-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:L L LiFull Text:PDF
GTID:2178360305488627Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese information processing is a tedious and massive information processing engineering, Chinese word processing is that the whole project-based and one among the important aspects. Computer understanding of Chinese language, Chinese language sentence must first be able to effectively identify and correct, we should correctly understand a word, will have to correct and appropriate word, in the word up to understand the basis of statements in order to achieve the image of Chinese language machine translation, machine marking, computer applications such as intelligent dialogue. However, comparing the Chinese text and English text that we can clearly understand that an English sentence is based on the composition of the word as a unit, and use a space to separate adjacent words; while the Chinese characters are based on word unit and which composed to a sentence to express a complete meaning. Computer can easily understand an English word, and thus make judgments accordingly; and composed by the single-word sentence it should first take the Chinese word for Chinese-language technology, into a single word sentence can be understood. In this paper, the Chinese word segmentation techniques in the following studies.First of all, briefly introduced the topic of the research background, the main research contents of this topic and research significance; a clear direction for the study in order to continue to the next step of the research work. Introduced the Chinese word segmentation and related technologies, the development of the status quo at home and abroad, citing a number of typical word segmentation algorithm and segmentation systems, defines the basic concepts of the Chinese word for the subsequent in-depth study and research to provide an initial basis of experience to draw upon.Secondly, the study presented the results of their predecessors based on an in-depth analysis and comparison of existing technologies and word segmentation system with their respective advantages and disadvantages, and points out the Chinese word facing difficulties; combination of the above analysis that the authors make Based on unique identifier of the sub-word method. In the Chinese part of speech analysis and research based on the characters of the part of speech in Chinese performance, through an extensive literature reference, summarized and made a special Chinese language identifier set for this next step in the sub-word method to conduct the research and lay the foundation for implementation.Furthermore, the existing sub-structure analysis of vocabulary words compared to understand the various advantages and disadvantages of dictionary mechanism, combined with a number of appearance characteristics of the Chinese words, the authors propose an improved to two terms as the root of the English word table structure, and the term sheet the results of a detailed explanation and description of the structure, compared to its structural sub-term advantage.Finally, the author of special identifier sets and an improved structure of the new dictionary combined in a laboratory environment, under the sub-word method of this test, using SOUGOU training corpus in the experimental texts were the subject word in this system, and other sub-word systems word experiments carried out by manual research results on this topic for the accuracy and speed of the test word than on the right.
Keywords/Search Tags:Chinese word segmentation, word segmentation algorithm, special identifier, vocabulary
PDF Full Text Request
Related items