Font Size: a A A

Research On Chinese Word Segmentation Of Search Engine

Posted on:2012-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:L Y RenFull Text:PDF
GTID:2218330344450976Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is not only the most widely used mean in natural language processing but also an essential basic skill in information retrieval and search engine. In recent years, there are some arithmetic, such as the method based on characters, probability, understanding, path and semantics. The most widely used one is based on characters. However, due to the special nature and complexity of Chinese language, there are two problems in the main Chinese word processing, which are called ambiguity and unlisted words. Therefore, a good method of Chinese word segmentation should have an efficient dictionary mechanism and can be able to accurately identify the ambiguous words and unknown words.The paper studied the existing Chinese word segmentation algorithms, dictionary mechanisms, processing strategy of ambiguous words and unknown words, and then proposed a segmentation method based on the corpus. The corpus was from People's Daily. On the one hand, the algorithm of Chinese word segmentation based on the maximum reverse matching and probability can segment Chinese very well. Aiming at the shortcomings of existing dictionary mechanisms, the paper proposed a dictionary mechanism for Chinese word segmentation based on finite-state automaton and it can improves in space complexity and time complexity. On the other hand, the paper researched the main problem----identifying of ambiguity and unlisted words by word patches, rules and corpuses.Based on researches of the Chinese word segmentation algorithm, the dictionary mechanism, the ambiguity and unlisted words above, the paper designed a prototype system, which included extraction of text, the training of corpus, word processing and testing. The paper validated the system's performance by experimentation of People's Daily. The system reaches a precision of 96% and the speed is above 1 200 words per second. At the same time, the paper summarized all the work and made the basic of forward further work.In a word, the paper analysed Chinese word segmentation by segmentation algorithm, dictionary mechanism, identifying of ambiguity and unlisted words. The new method proposed by the paper will be helpful for the future study.
Keywords/Search Tags:Chinese Word Segmentation, Dicitionary Mechanism, Maximum Match, Search Engine, Unknown Words Recognition
PDF Full Text Request
Related items