The Research Of Chinese Word Segmentation Algorithm Based On Dictionary And Probability Statistics

Posted on:2012-12-25

Degree:Master

Type:Thesis

Country:China

Candidate:A Y He

Full Text:PDF

GTID:2178330338454381

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

For Chinese natural language processing, Chinese word segmentation is the first step in text analysis. The current method of Chinese word segmentation can be divided into three kinds: the method based on dictionary, the method based on probability statistics and the method based on understanding. The understanding method is not mature. Today, the combination method based on statistical and dictionary is more popular. The difficult problem of Chinese word segmentation is the unknown word recognition and ambiguity processing.In recent years, to the unknown word recognition, many segmentation systems add a single identification module and establish relevant rules to recognize the unknown word. The research on named entities such as person name, place name and organization name, etc. has got good achievement. However, the research on web new words which have not special rules can not recognize. These words affect the accuracy of segmentation system. In recent years, although for ambiguity segmentation to improve the accuracy of segmentation ambiguity, but ambiguity segmentation problem is still urgent need to address the problem. For ambiguous word segmentation, although the accuracy has increased, this problem should be solved urgently.Therefore, this thesis uses statistics and dictionary method to solve the problem of unknown word recognition and ambiguity segmentation.This paper includes two betterments:In the first place, this paper uses a different direction to recognize the new word. We collect a large number of pages from different areas of the Internet and use our policy to recognize. Finally, we add these new words to the dictionary and to expand dictionary vocabulary. This is very effective to solve the unknown words of Chinese segmentation. Ultimately, improve the segmentation system's precision rate and recall rate.In the second place, we present reverse n-gram language model through the original n-gram language models. So, this paper proposes a language model based on two-way 3-gram language model. Finally, this paper adds the word information to the model. Adding the word information can improve system performance. This model can better handle the ambiguity of Chinese segmentation. Through the experimental comparison, our system can achieve good effect in speed and accuracy.

Keywords/Search Tags:

Chinese word segmentation, unknown word, ambiguity processing, language model, dictionary, Probability Statistics

PDF Full Text Request

Related items

1	Research And Implementation Of Chinese Word Segmentation Algorithm
2	Design And Implementation Of Chinese Word Segmentation Model Based On Combination Of Statistics And Rules
3	Reverse Backtracking Research Of Chinese Segmentation Based On Last Word Dictionary
4	Study On Disambiguation Algorithm For Chinese Word Segmentation
5	Research Of Combined Chinese Word Segmentation Method
6	Research And Implementation Of Chinese Word Segmentation Based On The Combination Of Statistics And Dictionary
7	Based On Dictionary And Word Frequency Analysis Of The Unknown Words From The Bbs Of Corpus Recognition Research
8	Research Of Chinese Word Segmentation Technology Applied In Police Information System
9	Study On Chinese Named Entity Recognition
10	A Statistics-Based Language Model Approach To Chinese Word Segmentation