Font Size: a A A

Chinese Word Segmentation Method Based On Dictionary And Statistics Of The Words

Posted on:2011-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y YueFull Text:PDF
GTID:2178360305483135Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the informationize, it is easy to obtain large amount of information. However, to deal with massive information manually is not possible,need to rely on the help of computer. And different from Western languages, Chinese, between words in no obvious signs segmentation. So, to make the computer capable of handling Chinese text, text must do chinese word segmentation first. As the complexity of chinese syntactic network and the the continual emergence of new words, Chinese word segmentation system has not achieved satisfactory results.This paper analyzes the actual use of Chinese word segmentation algorithm.many kinds of dictionary structure.Studied the current problems of Chinese word segmentation.In this paper,we used the combination method based on statistical and dictionary. Achieve Improvements in various aspects. First, we divide the whole text into shorter sentenees according to the Punetuations in this text.in the statistics, through the statistics of results of the fragmentation from Chinese word segmentation. Identified the unknown words in the text appeared more than once. And added to the temporary dictionary. Improved the structure of the dictionary. Put dictionary into a single basic dictionary and extended dictionary. This paper describes a Chinese word segmentation algorithm method based on statistical and dictionary we inerease the number of dictionaries, we add some sPeeial dictionaries which can be used to eliminate the ambiguousnesses and reeognize new words during the segmentation Process besides the basic dictionary.At the same time we reconstruted the data structure of basic dictionary in the memory of computer by using data structure"Hashtable",we choose the first two single charaeters of every word in the basic dictionary as the keywords of the main and sub Hashtables, the remanent words are stored in an array aceording to length.With these data structures, whenever our program meets a word, the program will be able to loeate the word straightly and quickly in dietionary.And adding word frequency information into the dictionary for ambiguity resolution. Extended dictionary include quantifiers dictionary, name dictionary, temporary dictionary, word dictionary Disable and so on.with the correct segmentation of the quantifiers reduce the number of ambiguity.Used improved mechanical method to do second word segmentation.Finally use rules to identification new words appear only once.This method has good ability to identification new words and ambiguity elimination. Basically satisfy the practical application of Chinese information processing requirements.
Keywords/Search Tags:Chinese word segmentation, Unkown words, word frequency statistic, Named Entity
PDF Full Text Request
Related items