Font Size: a A A

Research Dictionary System Based On Double Character Hash Index Of PAT Tree

Posted on:2012-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:2218330368482195Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is the base for the Chinese information processing, and it is very important, at the same time, Chinese information processing is one of important techniques at the computer application, which has been used to all kinds of domain, such as computer network, database, software engineer and so on. Chinese text is composed of serials of characters and there is no blank between them. Therefore, word segmentation is necessary for Chinese text processing and it must be dealt with it appropriately. Word segmentation which is efficient and exact is necessary for information processing.The Patricia tree dictionary system based on double-character-hash-indexing and the foundation of word segmentation are discussed in this paper. First, three common word segmentation dictionary systems and the dictionary system based on double-character-hash-indexing, Patricia tree and four-character-hash-indexing are studied and explored. Some questions are found in these systems and the unreasonable structures are improved. Taking the advantage of the large proportion of double words in the Chinese text and high efficiency of Hash search, the dictionary system of double-character-hash-indexing deals with the first two characters with the Hash.The experiment shows that this kind of dictionary mechanism is more efficient on double words processing, but it need be improved on the fact more than two words. Dictionary system based on Patricia tree is Superiority on time efficiency, but it needs more storage space. For this situation, thesis proposes the dictionary mechanism based on Patricia tree of double-character-hash-indexing, describes the process of inquiry and update. The dictionary mechanism based on double-character-hash-indexing of Patricia tree not only absorbs the high efficiency of double-character-hash-indexing, but also improves the efficiency of segmentation of more than two words. The depth of Patricia tree is controlled because hash table used for the first word.The thesis describes the generation of PAT-tree dictionary based on double character hash index, which is supplemented in the 3GWS gerund system. And it tests the time and space efficiency of the dictionary. The experiment states that this dictionary system is more greatly improved than word for word dictionary system and double character hash index system in time efficiency. The result states that this dictionary system is better than PAT-tree system in space efficiency and it helps to update the other dictionaries.On the basis of comprehensive compare and analysis of the mechanical Chinese word segmentation and traditional Chinese word segmentation which are often used, this thesis puts forward and implements a machine-statistics system.In order to close combine and complement disadvantages of this two methods, and to make best use of them, this thesis dose some deep research in the following aspects:In the mechanical Chinese word segmentation, changing the matching length of max matching method dynamically instead of statically in order to reduce the unnecessary matching operation. Making the information of frequency as another standard of Chinese word segmentation to cover the shortage of "long word first" standard; Using the segmentation dictionary based on Hash structure to increase the efficiency of word segmentation; In the statistical Chinese word segmentation, in order to increase the efficiency of statistics operation, this thesis generalizes the concept of segmentation unit, mingling the statistics operation and the mechanical Chinese word segmentation, meanwhile, using the Hash structure to store the results of the statistics operation, thus the speed of mechanical word segmentation has been raised.Finally, the problem in the process of obtaining dictionary and future works are discussed in this paper.
Keywords/Search Tags:word segmentation, dictionary, corpus, hash table, PAT tree
PDF Full Text Request
Related items