Font Size: a A A

Technology Research On Chinese English Text Level Sentence Alignment

Posted on:2015-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:K J SunFull Text:PDF
GTID:2308330473953639Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Bilingual corpus is a warehouse of store two semantic aligned corpus resources and information, which is an important resource for many language processing and machine translation. Bilingual corpus is widely used in machine translation, machine aided human translation, translation knowledge extraction, WSD, cross language information retrieval. Among them, the alignment is the key technology to processing bilingual text, the alignment result have a direct impact on the future work.Bilingual text level sentence alignment contains the paragraph alignment and sentence alignment, the alignment principle is similar to the two. Based on the actual situation of Chinese English bilingual text, the article mainly from the aspects of alignment precision rate and recall rate on the premise of balance, improve the text alignment speed.Firstly, according to the anchor alignment the algorithm is proposed in this paper, the text is divided into blocks. This method used named entities in the text, such as names, places, organizations and numbers, time, date, etc. Using the dynamic programming algorithm di-vides the text into several fragments. The experiments show that the anchor alignment algo-rithm precision in paragraph alignment and sentence alignment can reach 98%.Then, in the experiment of paragraph alignment, there have a good result when combined the length based alignment method and the dictionary based alignment method. The dictionary based alignment method use equal weight. The accuracy rate reached 93.4%. The recall rate increased and the alignment speed is 2.5 times that of the previous when add anchor point alignment method. In the sentence alignment experiments, respectively, there were directly sentence alignment and the first paragraph alignment before sentence alignment experiment. The experiments show that did the paragraph first was better than direct sentence alignment. Without paragraph information, the direct sentence alignment which combined the length based alignment method and the dictionary based align method, the dictionary weight used TF-IDF theory, The precision reached 93.6%, when added the anchor point align method, the precision rate is identical and the recall rate increased by 0.5 percent, the speed is im-proved 3.4 times. With paragraph information, the paragraph method combined the length based align method and the dictionary based align method, which the dictionary weight is equal. After the paragraph alignment, the sentence alignment used the same method but the dictionary weight used TF-IDF theory, the precision rate reached 92.8%. In the alignment process, when added anchor point align method, the precision rate is identical and the recall rate increased by 0.5 percent points.The main work of this paper is divided into two parts:First is the anchor alignment method. Based on the alignment method, the text was divided into small pieces, then did the paragraph alignment or sentence alignment, the experiment shows that the precision can reach 98%. The second is combined with the characteristic length and alignment method of bilin-gual dictionary, merge the two into the text level paragraph alignment and sentence alignment, summarized the applicable method to the paragraph alignment and sentence alignment.
Keywords/Search Tags:Paragraph Alignment, Sentence Alignment, Anchor Match, Entity Recognition, TF-IDF Weight
PDF Full Text Request
Related items