Font Size: a A A

Extracting Parallel Sentence From Large Scale Web Data

Posted on:2012-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:C WangFull Text:PDF
GTID:2218330362950434Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In this passage we give both method and realization of extracting parallel sentence pair form web data. About the method of mining, we divide whole web data into contrast page type and parallel page type, and extract parallel text block separately, we achieve 81% on recall by using dictionary based bilingual page filter method and using self-adaption pattern matching method to extract parallel section in contrast web page. On parallel web page, we use URL similar matching method to pairing parallel page pair, then we use node matching method to extract parallel section, and get 75% on recall. Sentence break and align was apply on the parallel text block, so parallel sentence candidate is generated. After refining the sentence, we grade and apply filter on sentence pair by our score method. At last 6.6 million unique sentence pair has been extracting from web data. The final Quality Estimate on sentence pair get above 96% on the coverage, and above 93% on usability. About the pipeline of mining, we provide pipeline for deal with large scale dataset. Contract web page was divided by data volume, and parallel web page was divided by closure subset. We bring up step for solve large scale and incremental update problem. After all, running on 7.5 billion web page, contract page type use 48 hours, as parallel page type use 24 hours.The parallel sentence pair extract method provide by this passage realize mining sentence pair from both contrast web page and parallel web page, and get final sentence pair result in a certain period. By apply increasment update on mining pipeline, we accomplish multi-source data set combination and new data set adding problem. Large scale data processing demonstrate the high usability of this mining pipeline. Follow the Quality Estimate, feasibility on using result sentence pair as parallel corpus was clearly stated.
Keywords/Search Tags:parallel sentence mining, sentence score, contract page select, parallel page predicate, MapReduce
PDF Full Text Request
Related items