Font Size: a A A

Participle Dictionary Build

Posted on:2011-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:H WuFull Text:PDF
GTID:2178360305492498Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the unceasingly growing of online digital information resources, it has become an important research topic that how to automatically process these information. Chinese word segmentation plays a very important role in Chinese massive information processing. As there isn't obsvious segmentation mark in chinese characters, and if simply putting a single character as the basic unit of information processing, it not only lacks the necessary semantic expressing, but also brings a lot of redundant information, thus word segmentation algorithm is widely used in various fields of Chinese information processing. Existing Chinese word segmentation systems are mostly based on the dictionary to match the first word, then using syntactic-semantic relationship and statistical method to deal with ambiguity processing and not register words processing. The superior and inferior of segmentation dictionary mechanism directly affects the speed and efficiency of the sestem,therefore,it is imperative to establish a high efficient and fast mechanism.The common segmentation dictionary mechanisms are as follows: binary-seek-by-character,binary-seek-by-word and TRIE indexing tree. In the previous analysis of segmentation dictionary mechanism, the three dictionaries all build up index table based on the first character.Through ststistics we know that in chinese the appearance probability of two-word words and one-word words is far larger than that of other word-ength words,according to this phenomenon,we put forword such an idea that we can build up index table by seeming the first two words as the keywords,and this index table is a two-dimensional array, this algorithm directly determines data items'location by establishing a corresponding relationship between the first two chinese characters'internal code and array index,in this way, we can directly find out the two-word words by using the two-dimensional array,and then proceeding to the following query match. This approach can significantly reduce the times of queries,so as to further accelerate the speed of segmentation.On the basis of selecting and processing of word corpus,the paper finally established a segmentation dictionary test system based on two-word-array ,which has such functions as automatic segmentation,words inquery and dictionary maintenance.
Keywords/Search Tags:Dictionary Mechanism, Segmentation Dictionary, Two-Word-Array
PDF Full Text Request
Related items