Font Size: a A A

Research And Implement On Mining Parallel Bilingual Translation Pair From The Web

Posted on:2008-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:C FanFull Text:PDF
GTID:2178360242971378Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recently, the bilingual translation pairs including sentences, phrases and words play an increasingly important role in the natural language processing area. Translation pairs are the basic resources for Cross Language Information Retrieval (CLIR) and machine translation system. The performance of these instance based application can be benefited by translation pairs. Many methods are proposed for extracting translation pairs. The early methods focus on extracting from the parallel text. However, some problems such as lack corpus of scale, domain limitation and out of processing Out Of Vocabulary (OOV) terms can't be solved. While the rapid development of the World Wide Web, a number of translation pairs are available on the Web. And the translation pairs on the web are usually diverse and contain lots of OOV terms. Hence, extracting translation pairs from the web become an important research area in information extraction filed.This paper proposed a novel method to extract bilingual translation pairs from the web. Based on the observation that translation pairs tend to appear collectively on the web, a recursive process is used to extract high quality translation pairs from the web. First query the search engine with some seed data and crawl the returned pages. Then identify the Collective Translation Pair Block (CTPB) which contains the collective translation pairs using a heuristic evaluation method. After the CTPB has been identified, a PAT tree is employed to generate the extraction patterns automatically. Then a ranking SVM model is used to re-rank these patterns based on the F measure. The top 10 patterns are adopted to extract the translation pairs with the help of surface pattern. At last in order to get the high quality extraction translation, the extracted translation pairs are verified by a SVM classifier based on the translation relevant between the source and the target language.Contribution of this study can be summarized as follows:①this paper proposed a snowball-like method to extract the translation pairs from the web.②An integration mining scheme is presented to discover, extract and verify the translation pairs from the web. The results of the experiments show that our scheme gains higher extraction performance than previous approaches.
Keywords/Search Tags:bilingual translation pair, web mining, pattern discovery, Machine Learning, Information Extraction
PDF Full Text Request
Related items