Font Size: a A A

The Research Of Chinese Word Segmentation Algorithm Based On Dictionary And Probability Statistics

Posted on:2012-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:A Y HeFull Text:PDF
GTID:2178330338454381Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
For Chinese natural language processing, Chinese word segmentation is the first step in text analysis. The current method of Chinese word segmentation can be divided into three kinds: the method based on dictionary, the method based on probability statistics and the method based on understanding. The understanding method is not mature. Today, the combination method based on statistical and dictionary is more popular. The difficult problem of Chinese word segmentation is the unknown word recognition and ambiguity processing.In recent years, to the unknown word recognition, many segmentation systems add a single identification module and establish relevant rules to recognize the unknown word. The research on named entities such as person name, place name and organization name, etc. has got good achievement. However, the research on web new words which have not special rules can not recognize. These words affect the accuracy of segmentation system. In recent years, although for ambiguity segmentation to improve the accuracy of segmentation ambiguity, but ambiguity segmentation problem is still urgent need to address the problem. For ambiguous word segmentation, although the accuracy has increased, this problem should be solved urgently.Therefore, this thesis uses statistics and dictionary method to solve the problem of unknown word recognition and ambiguity segmentation.This paper includes two betterments:In the first place, this paper uses a different direction to recognize the new word. We collect a large number of pages from different areas of the Internet and use our policy to recognize. Finally, we add these new words to the dictionary and to expand dictionary vocabulary. This is very effective to solve the unknown words of Chinese segmentation. Ultimately, improve the segmentation system's precision rate and recall rate.In the second place, we present reverse n-gram language model through the original n-gram language models. So, this paper proposes a language model based on two-way 3-gram language model. Finally, this paper adds the word information to the model. Adding the word information can improve system performance. This model can better handle the ambiguity of Chinese segmentation. Through the experimental comparison, our system can achieve good effect in speed and accuracy.
Keywords/Search Tags:Chinese word segmentation, unknown word, ambiguity processing, language model, dictionary, Probability Statistics
PDF Full Text Request
Related items