Chinese Word Segmentation Method Based On Dictionary And Statistics Of The Words

Posted on:2011-06-21

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Yue

Full Text:PDF

GTID:2178360305483135

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of the informationize, it is easy to obtain large amount of information. However, to deal with massive information manually is not possible,need to rely on the help of computer. And different from Western languages, Chinese, between words in no obvious signs segmentation. So, to make the computer capable of handling Chinese text, text must do chinese word segmentation first. As the complexity of chinese syntactic network and the the continual emergence of new words, Chinese word segmentation system has not achieved satisfactory results.This paper analyzes the actual use of Chinese word segmentation algorithm.many kinds of dictionary structure.Studied the current problems of Chinese word segmentation.In this paper,we used the combination method based on statistical and dictionary. Achieve Improvements in various aspects. First, we divide the whole text into shorter sentenees according to the Punetuations in this text.in the statistics, through the statistics of results of the fragmentation from Chinese word segmentation. Identified the unknown words in the text appeared more than once. And added to the temporary dictionary. Improved the structure of the dictionary. Put dictionary into a single basic dictionary and extended dictionary. This paper describes a Chinese word segmentation algorithm method based on statistical and dictionary we inerease the number of dictionaries, we add some sPeeial dictionaries which can be used to eliminate the ambiguousnesses and reeognize new words during the segmentation Process besides the basic dictionary.At the same time we reconstruted the data structure of basic dictionary in the memory of computer by using data structure"Hashtable",we choose the first two single charaeters of every word in the basic dictionary as the keywords of the main and sub Hashtables, the remanent words are stored in an array aceording to length.With these data structures, whenever our program meets a word, the program will be able to loeate the word straightly and quickly in dietionary.And adding word frequency information into the dictionary for ambiguity resolution. Extended dictionary include quantifiers dictionary, name dictionary, temporary dictionary, word dictionary Disable and so on.with the correct segmentation of the quantifiers reduce the number of ambiguity.Used improved mechanical method to do second word segmentation.Finally use rules to identification new words appear only once.This method has good ability to identification new words and ambiguity elimination. Basically satisfy the practical application of Chinese information processing requirements.

Keywords/Search Tags:

Chinese word segmentation, Unkown words, word frequency statistic, Named Entity

PDF Full Text Request

Related items

1	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
2	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-Segmentation With Statistic
3	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-segmentation With Statistic
4	The Research On Chinese Word Segmentation System Based On SVM
5	Research On Chinese Named Entity Recognition Based On Feature Enhancement
6	Design And Implementation Of Chinese Word Segmentation System Based On Grammar
7	Research On Chinese Word Segmentation Methods Using Context Information
8	Research On Chinese Named Entity Recognition And New Word Detection
9	Comparative Research On Open-Source Chinese Word Segmentation Machines
10	Research On Chinese Word Segmentation Method Based On Word Embedding