Font Size: a A A

Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences

Posted on:2007-05-11Degree:Ph.DType:Thesis
University:Hong Kong Polytechnic University (People's Republic of China)Candidate:Wang, RongboFull Text:PDF
GTID:2448390005465735Subject:Computer Science
Abstract/Summary:
This thesis addresses two important problems in Chinese information processing, namely Chinese chunk segmentation and the similarity measure of Chinese sentences. The three main contributions reported in this thesis are: (1) a novel Chinese chunk segmentation technique using a statistical model combined with correction rules generated using an error-correction mechanism; (2) a novel similarity measure of Chinese sentences using both word/chunk sequences and POS (Part of Speech) tag sequences of Chinese sentences; and (3) the optimization of parameters used in the combined similarity measure approach by applying a relevance feedback technique and a neural network model.; In the first investigation, a statistical model combined with correction rules generated by an error-correction mechanism is proposed for Chinese chunk segmentation. Chunk segmentation of Chinese sentences in the training corpus was carried out manually to provide a ground rule for training the statistical model with which preliminary chunk segmentation results will be obtained. The chunk segmentation result (correctly and incorrectly segmented chunks) from the statistical model is utilized to generate a set of correction rules for refining the segmentation result. This set of correction rules is generated by an error-correction mechanism in which a comparison between the preliminary segmentation result and the manually segmented result is performed. The statistical model and the learned correction rules can then be used to perform Chinese chunk segmentation of unseen sentences.; In the second investigation, novel similarity measures of Chinese sentences are proposed by using word/chunk sequences and POS tag sequences of Chinese sentences. The sentence similarity measure is one of very important components in example-based machine translation (EBMT). For Chinese sentences there is no delimiter between any two words, which is different from English sentences. Hence, Chinese word/chunk delimitation should be performed first before a sentence similarity measure can be computed. Both word/chunk sequence feature and POS tag sequence feature used in our proposed similarity measures are based on word/chunk segmentation. Sentence structure information is partially reflected in the POS tag sequence. For the proposed word-sequence-matching-based (WSMB) method, we take into consideration three factors between two sentences: the number of identical word sequences, the length of each identical word sequence, and the average weighting (AW) of each identical word sequence. In computing AW we weight every POS tag according to its importance. The POS-tag-sequence-matching-based (PTSMB) method is to measure the similarity of Chinese sentences in terms of their structures. If the constituents in two Chinese sentences are similar, then we can judge that these two Chinese sentences are similar in structure. The main idea of this similarity measure is that we perform matching between the POS's of two Chinese sentences using directed graphs. The POS weighting is also utilized in the process.; In the third investigation, we propose a human-computer interaction approach to optimize parameters used in the combined similarity measure of Chinese sentences based on a relevance feedback scheme and a neural network model. In the relevance feedback process, users' intentions and preferences to rank the candidate sentences are captured and used to modify parameters in the similarity measure. For the parameter optimization research, a web-based questionnaire was designed to collect users' feedback data. In this pioneering study, we constructed 50 groups of sentences. There is one source sentence and ten sentences to be retrieved for every group. The ten test sentences are shown in descending order of similarity to the source sentence. The user is asked to provide a new rank according to his or her judgment if he/she does not agree with the ranking done by the computer. The new rank is converted...
Keywords/Search Tags:Chinese, Similarity measure, POS tag, Statistical model, Correction rules
Related items