Font Size: a A A

Research On Chinese Word Segmentation Technology With Word Length And Rule Algrithm

Posted on:2014-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2268330425990644Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer software, hardware and computer network. People have entered the information age. In the information society, the important of information is increasing by days. Whether individuals, companies, and governments need to obtain and master a lot of useful information. In this environment, Chinese information processing technology has become an important research technique. And one of the most important is the Chinese word segmentation technology. Chinese and English are different, Chinese text is a continuous string. In addition to punctuation outside, there has no obvious signs segmentation between words and the method of language writing has no word boundary. The primary problem of Chinese information processing is the segmentation problem.Study on the basis of Chinese word segmentation algorithms and technologies. This paper proposes a hash dictionary mechanism with word length and to improve the matching efficiency of long word. This paper also proposes a verb decision algorithm and priority decision algorithm of ambiguity resolution mechanism. Effectively improve the accuracy of the ambiguity resolution. The following is the main contents of this-paper.(1) Study on the research background and significance of the study of Chinese word segmentation. Study on the technology of Chinese word segmentation and several major existing Chinese word segmentation algorithms. Chinese word segmentation algorithms are the main existing:word segmentation algorithm based on the dictionary; segmentation algorithm based on statistics; Segmentation algorithm based on Rules and Segmentation algorithm based on understanding and so on.(2) Research on several current dictionary mechanisms:The dictionary mechanism based on the whole word two points; the dictionary mechanism based on TRIE tree; based on word two dictionary mechanism and based on hash dictionary mechanism. Integrated storage space and search time on two considerations and comparison of the above several dictionary mechanism. This paper uses hash dictionary mechanism. This paper proposes a hash dictionary mechanism with word length and to improve the matching efficiency of long word.(3) Study on the main problems of Chinese word segmentation:The ambiguity problem. Ambiguity includes overlapping ambiguity, combinational ambiguity and real ambiguity. This paper introduces the three kinds of ambiguity. And introduce three kinds of ambiguity acquisition algorithms:bilateral maximum scanning method; maximum matching word scanning and the longest word long term discovery algorithm. This paper uses bilateral maximum scanning method. This paper improves the back word combination algorithm. And add digestion verb decision algorithm and priority decision algorithm for ambiguity. To a certain extent and improve the accuracy of ambiguity resolution(4) Using VC++6.0integrated development tool. Use the proposed algorithm design and implementation of the Chinese word segmentation system. The working principle of each module in the architecture of the system and described.
Keywords/Search Tags:Chinese word segmentation, Chinese information processing Withword length dictionary, Rule algorithm
PDF Full Text Request
Related items