Segmentation Dictionary Hash-based Mechanism Design And Realization

Posted on:2009-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:Q L Wang

Full Text:PDF

GTID:2208360245961841

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Chinese language information processing has become a popular research subject in the information processing field as the Chinese becomes an important tool for information transmitting and communication. Chinese Automatic Segmentation is an important part of Chinese information processing, and word processing efficiency is a key factor to the performance of Chinese automatic segmentation system. Therefore, establishing a high efficiency dictionary has significant meaning. In this paper, we made a deep study on the dictionary structure of Chinese word segmentation system. Some of the reach achievements we made have been used in the national 863 junk message management project and we also applied patent for the algorithms. The main research results are as follows:1 Several kinds of dictionary storage and processing algorithms were compared, analyzed, and improved. The array-array method uses array to storage word lies and binary search algorithms to find them, but it is low in operating efficiency and has a disadvantage in the dictionary update. The double-word-hash method hashes the first two words, uses a depth two TRIE tree, but it's structure is complex. The four-word-hash method only works for four-word idioms so there are limitations in application. In this paper, these methods were improved to solve the efficiency problem.2 According to the characteristics of Chinese GB codes, a relatively efficient st-orage algorithm is proposed. We store terms which have the same capital as a single text line, remove the term capital, term hash value and relevant attributes so every term can be formatted. The memory using efficiency is also enhanced.3 New Hash-based dictionary searching, updating, deleting and adding algorithms were proposed according to that Hash Table has an advantage in search efficiency. We designed a useful Hash function. Experiments show that its collision rate is extremely low, very suitable for small and medium-sized dictionary. Large dictionaries can also use it after improvements. 4 The arrays, linked list, AVL tree, Hash Table (with both minimal collision and collision-free methods) algorithms were implemented respectively. We made a detailed analysis and evaluation of these algorithms on time complexity and space complexity aspects. We compared the new methods and the traditional ones on loading, writing, file size and operating hours, and experiment showed that our algorithms is two times higher in memory efficiency and five to six times higher in time efficiency that the old ones.5 The Hash-based dictionary structure program module was finished in Java language and an expansion solution on word term attributes of this method wa-s given.

Keywords/Search Tags:

Segmentation Dictionary, Hash, AVL tree, TRIE indexing tree

PDF Full Text Request

Related items

1	Research Dictionary System Based On Double Character Hash Index Of PAT Tree
2	Research On IP Lookup Algorithm Based On Dynamic Programming And B+ Tree
3	Research Of Personalized Search Based On Trie Tree
4	The Research Of IPv6 Routing Lookup Algorithm Based On Hash Table And Multibit Trie
5	Design And Implementation Of Personalized Recommendation Engine Based On Improved Trie Tree And AC Algorithm In Android Platform
6	The Adaptive Trie Tree And The Improved Algorithms Of GPERF
7	Chemical Dictionary Of Structural Design And Development Of Chinese Word Segmentation System
8	Research On Efficient Dictionary Generation Method Based On Password Features
9	Multidimensional range searching using G-tree and B*-tree indexing
10	Research On Main Memory Database Indexing Based On GPU