Research On Chinese Word Segmentation For Large Scale Information Retrieval

Posted on:2007-10-12

Degree:Master

Type:Thesis

Country:China

Candidate:S L Wang

Full Text:PDF

GTID:2178360185454135

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Chinese Word Segmentation (CWS) is one of the fundamental components in Chinese information processing, and it is also frequently used in the text operation of Chinese Information Retrieval (CIR). There are many researches which focus on the technology of CWS. However, lots of them are always on universal algorithms, seldom specialized for information retrieval.First, this thesis begins with an introduction of the difficulties of CWS technology and several segmentation algorithms in frequent use now. Then through an investigation into the influence of CWS on CIR, it summarizes the characteristics of CWS technology which is suitable for large scale CIR. Finally, we propose and develop such a CWS algorithm which is more suitable for CIR.Because of high demand for speed on segmentation algorithm of information retrieval, our lexicon mechanism adopts an improved double-array trie algorithm, which only requires n-1 addition operations while searching a word, and the time complexity is O(n), where n is the length of the query word. Our experiment show that the improved double-array trie algorithm is faster than trie and double-character hash both on searching words and Maximum matching segmentation.Ambiguity resolution and unknown word identification are two difficulties in CWS. According to the characteristics of CIR, we only resolve overlapping ambiguity during the disambiguation phase, using double-character coupling and difference of t-test to decide the ambiguous segmentation position, and carry out overlaying ambiguity in the query expansion phase. In the unknown word identification phase, the position probabilities of a single Chinese character in the word are considered together with the local bi-characters frequency to identify both the named entities and the new words.Our experiments are made on a PC equipped with Pentium 4 CPU at 3.2 G-Hz, 512 M-Byte of RAM memory .The experiment results show that the speed of this segmentation algorithm can achieve about 2MB/s, which is much faster than many more accurate methods such as ICTCLAS. Meanwhile, in the same retrieval system, compared with Overlapping Bi-gram, which is frequently used in the CIR, Maximum Match segmentation, which is one of the most popular CWS algorithms, and the ICTCLAS, the P@10 (which is the precision of the first 10 documents) is improved by 9%, 11.4% and 8.8% separately, and the P@20 (which is the precision of the first 20 documents) is improved by 13.2%, 12.7% and 7.5%.

Keywords/Search Tags:

information retrieval, Chinese word segmentation, double-array trie, double-character coupling, difference of t-test

PDF Full Text Request

Related items

1	Study On Efficient Indexing For Large Scale Chinese Text Retrieval Systems
2	Research And Improvement Of ICTCLAS Chinese Lexical Analysis System
3	Research On Efficient Index Structure And Parallelization Based On Double Array Trie
4	Research And Application Of The Key Techniques In Chinese Query Answering System Of Networking Education
5	Dictionary Based Chinese Word Segmentation Algorithm And Its Application In Nutch System
6	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
7	Research Of Chinese Word Segmentation With Conditional Random Fields And Implementation
8	Study On Coupling Technology Between Double-Cladding Fiber And Laser Diode Array
9	Research On Cloud Data Security Deduplication Technology Based On Double Array Trie Tree
10	The Research On Chinese Word Segmentation System Based On SVM