Font Size: a A A

Segmentation Dictionary Hash-based Mechanism Design And Realization

Posted on:2009-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Q L WangFull Text:PDF
GTID:2208360245961841Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Chinese language information processing has become a popular research subject in the information processing field as the Chinese becomes an important tool for information transmitting and communication. Chinese Automatic Segmentation is an important part of Chinese information processing, and word processing efficiency is a key factor to the performance of Chinese automatic segmentation system. Therefore, establishing a high efficiency dictionary has significant meaning. In this paper, we made a deep study on the dictionary structure of Chinese word segmentation system. Some of the reach achievements we made have been used in the national 863 junk message management project and we also applied patent for the algorithms. The main research results are as follows:1 Several kinds of dictionary storage and processing algorithms were compared, analyzed, and improved. The array-array method uses array to storage word lies and binary search algorithms to find them, but it is low in operating efficiency and has a disadvantage in the dictionary update. The double-word-hash method hashes the first two words, uses a depth two TRIE tree, but it's structure is complex. The four-word-hash method only works for four-word idioms so there are limitations in application. In this paper, these methods were improved to solve the efficiency problem.2 According to the characteristics of Chinese GB codes, a relatively efficient st-orage algorithm is proposed. We store terms which have the same capital as a single text line, remove the term capital, term hash value and relevant attributes so every term can be formatted. The memory using efficiency is also enhanced.3 New Hash-based dictionary searching, updating, deleting and adding algorithms were proposed according to that Hash Table has an advantage in search efficiency. We designed a useful Hash function. Experiments show that its collision rate is extremely low, very suitable for small and medium-sized dictionary. Large dictionaries can also use it after improvements. 4 The arrays, linked list, AVL tree, Hash Table (with both minimal collision and collision-free methods) algorithms were implemented respectively. We made a detailed analysis and evaluation of these algorithms on time complexity and space complexity aspects. We compared the new methods and the traditional ones on loading, writing, file size and operating hours, and experiment showed that our algorithms is two times higher in memory efficiency and five to six times higher in time efficiency that the old ones.5 The Hash-based dictionary structure program module was finished in Java language and an expansion solution on word term attributes of this method wa-s given.
Keywords/Search Tags:Segmentation Dictionary, Hash, AVL tree, TRIE indexing tree
PDF Full Text Request
Related items