Font Size: a A A

Research On Chinese Word Segmentation For Large Scale Information Retrieval

Posted on:2007-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:S L WangFull Text:PDF
GTID:2178360185454135Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation (CWS) is one of the fundamental components in Chinese information processing, and it is also frequently used in the text operation of Chinese Information Retrieval (CIR). There are many researches which focus on the technology of CWS. However, lots of them are always on universal algorithms, seldom specialized for information retrieval.First, this thesis begins with an introduction of the difficulties of CWS technology and several segmentation algorithms in frequent use now. Then through an investigation into the influence of CWS on CIR, it summarizes the characteristics of CWS technology which is suitable for large scale CIR. Finally, we propose and develop such a CWS algorithm which is more suitable for CIR.Because of high demand for speed on segmentation algorithm of information retrieval, our lexicon mechanism adopts an improved double-array trie algorithm, which only requires n-1 addition operations while searching a word, and the time complexity is O(n), where n is the length of the query word. Our experiment show that the improved double-array trie algorithm is faster than trie and double-character hash both on searching words and Maximum matching segmentation.Ambiguity resolution and unknown word identification are two difficulties in CWS. According to the characteristics of CIR, we only resolve overlapping ambiguity during the disambiguation phase, using double-character coupling and difference of t-test to decide the ambiguous segmentation position, and carry out overlaying ambiguity in the query expansion phase. In the unknown word identification phase, the position probabilities of a single Chinese character in the word are considered together with the local bi-characters frequency to identify both the named entities and the new words.Our experiments are made on a PC equipped with Pentium 4 CPU at 3.2 G-Hz, 512 M-Byte of RAM memory .The experiment results show that the speed of this segmentation algorithm can achieve about 2MB/s, which is much faster than many more accurate methods such as ICTCLAS. Meanwhile, in the same retrieval system, compared with Overlapping Bi-gram, which is frequently used in the CIR, Maximum Match segmentation, which is one of the most popular CWS algorithms, and the ICTCLAS, the P@10 (which is the precision of the first 10 documents) is improved by 9%, 11.4% and 8.8% separately, and the P@20 (which is the precision of the first 20 documents) is improved by 13.2%, 12.7% and 7.5%.
Keywords/Search Tags:information retrieval, Chinese word segmentation, double-array trie, double-character coupling, difference of t-test
PDF Full Text Request
Related items