Font Size: a A A

Web - Based English - Chinese Bilingual Parallel Sentences

Posted on:2016-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:L T PanFull Text:PDF
GTID:2208330470470570Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays parallel corpora have become critical resource for work in multilingual natural language processing. It has been get more and more recognition in this field. English and Khmer (English-Khmer) parallel bilingual corpus are essential and fundamental resources for Khmer language information processing. Building an English-Khmer corpus, which has very important research significance, will promote the Khmer natural language processing technology. This paper did some relevant researches about the bilingual web documents mining, the generation of candidate parallel web documents, the identification of the bilingual parallel web pages as well as the extraction of the bilingual parallel sentences. The main research work have completed as follows:(1) Because of using the existing spider will receive lots of noisy web documents, so we have written a spider with the HtmlUnit API to improve the grasp quality. It can pinpoint the web resources and collect the useful information which we focus on through the website template we made. We use the crawler to get the web resources which are the foundation for the subsequent parallel bilingual web documents extraction and save them in a database.(2) The acquisition of the candidate web document overly depends on web page naming similarity. There are two approaches we presented to solve this problem.The first one, we use similarity of the title to select the candidate web page pair, which is suitable for bilingual web pages without obvious limitation. Another one, we use the structured query language to select the record which meet the limiting condition as our candidate parallel web pages.(3) The recognition of bilingual parallel web pages. First of all, making the candidate web page pair into two vectors based on VSM (vector space model). Then taking advantage of Cosine Similarity to compute the two vectors’similarity in order to recognize the parallel web page pair. This method has a high accuration, but it is not suitable for a large number of candidate pairs. So we present another way to cope with the very problem. We can regard the identification of the parallel web page pair as the classification of the candadite web pages. For collecting the bilingual parallel web pages, we can filter the unparallel web pages from the candadite pairs by training an effective maximum entropy classifier.(4) The extraction of the parallel sentences. In order to verify whether a candidate sentence pair is truly parallel, a binary maximum entropy based classifier is used. There are four features are used to train the maximum entropy model. That is, the feature of text length, the ratio of lexicalization of the text, sentence position feature and symbols characteristics.(5) Design and implement a prototype system to extraction parallel sentence pair, which provides basic resources for further study on Khmer natural language processing.
Keywords/Search Tags:Khmer, parallel corpora, bilingual parallel web page, bilingual parallel sentence pair, Maximum Entropy Model
PDF Full Text Request
Related items