Font Size: a A A

Research On Parallel Resources Mining From The Internet

Posted on:2011-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z X YanFull Text:PDF
GTID:2178360305476162Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Large-scale bilingual corpus can benefit many Natural language processing (NLP) applications, such as machine translation and cross language information retrieval. Although many previous studies have spent a lot of manpower, material and financial resources in obtaining bilingual corpora, the existing corpus acquired is far from enough to meet the need of processing real text because of its small scale, poor timeliness and imbalance of domains. In this paper, we focus on building a platform to obtain large-scale bilingual parallel corpus automatically.As far as we know, there are considerable duplicated texts in the whole Internet, so we need to combat this problem with the purpose of extracting bilingual resources. By searching similar articles with keyword and acquiring repetitive paragraphs from these similar articles, we can keep the balance of the precision rate and recall rate effectively. In this paper, keywords are selected by an iterative method between sentences and words. In a word, eliminating duplicated web pages laid a solid foundation for improving the efficiency of bilingual extraction.There are plenty of parallel sentences embedded in the bilingual mixed web pages. In this paper, we aim to extract parallel sentence pairs by analyzing the structure and the content of a given page fully. At first, we acquire bilingual mixed web pages based on usual text search engines, and then with regard of each bilingual page, we collect candidate parallel sentences by segmenting the whole page into different data regions. In order to align these candidate resources, we will make use of word-overlap, length-based measure and M-N HTML tag node alignment. Finally, the pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, text location and alignment features.Website authors usually follow some rules in naming parallel bilingual webpages. Based on this observation, we present an algorithm to locate parallel pages within bilingual web sites. In addition, the issue of identifying parallel texts is seen as a classification problem, which is closely related to the length of page's content, word-overlap in anchor text and word alignment. In this paper, we locate bilingual web site by pre-defined string patterns and edit distance.
Keywords/Search Tags:Parallel Resource, Web Mining, Parallel Sentence, Web page de-duplicate
PDF Full Text Request
Related items