Web - Based English - Chinese Bilingual Parallel Sentences

Posted on:2016-07-23

Degree:Master

Type:Thesis

Country:China

Candidate:L T Pan

Full Text:PDF

GTID:2208330470470570

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Nowadays parallel corpora have become critical resource for work in multilingual natural language processing. It has been get more and more recognition in this field. English and Khmer (English-Khmer) parallel bilingual corpus are essential and fundamental resources for Khmer language information processing. Building an English-Khmer corpus, which has very important research significance, will promote the Khmer natural language processing technology. This paper did some relevant researches about the bilingual web documents mining, the generation of candidate parallel web documents, the identification of the bilingual parallel web pages as well as the extraction of the bilingual parallel sentences. The main research work have completed as follows:(1) Because of using the existing spider will receive lots of noisy web documents, so we have written a spider with the HtmlUnit API to improve the grasp quality. It can pinpoint the web resources and collect the useful information which we focus on through the website template we made. We use the crawler to get the web resources which are the foundation for the subsequent parallel bilingual web documents extraction and save them in a database.(2) The acquisition of the candidate web document overly depends on web page naming similarity. There are two approaches we presented to solve this problem.The first one, we use similarity of the title to select the candidate web page pair, which is suitable for bilingual web pages without obvious limitation. Another one, we use the structured query language to select the record which meet the limiting condition as our candidate parallel web pages.(3) The recognition of bilingual parallel web pages. First of all, making the candidate web page pair into two vectors based on VSM (vector space model). Then taking advantage of Cosine Similarity to compute the two vectorsâ€™similarity in order to recognize the parallel web page pair. This method has a high accuration, but it is not suitable for a large number of candidate pairs. So we present another way to cope with the very problem. We can regard the identification of the parallel web page pair as the classification of the candadite web pages. For collecting the bilingual parallel web pages, we can filter the unparallel web pages from the candadite pairs by training an effective maximum entropy classifier.(4) The extraction of the parallel sentences. In order to verify whether a candidate sentence pair is truly parallel, a binary maximum entropy based classifier is used. There are four features are used to train the maximum entropy model. That is, the feature of text length, the ratio of lexicalization of the text, sentence position feature and symbols characteristics.(5) Design and implement a prototype system to extraction parallel sentence pair, which provides basic resources for further study on Khmer natural language processing.

Keywords/Search Tags:

Khmer, parallel corpora, bilingual parallel web page, bilingual parallel sentence pair, Maximum Entropy Model

PDF Full Text Request

Related items

1	Research On Key Technology In Mining Web Bilingual Corpora
2	Research On Large-Scale Bilingual Parallel Corpus Extraction From The Web
3	Comparable Corpus Acquisition Of Cambodian-Chinese Parallel Sentence Pairs Based On Bidirectional Recurrent Neural Network
4	Research On Chinese-Thai Bilingual Corpus Mining Method For Internet News
5	Bilingual Word Representation Learning From Non-parallel Corpora
6	Design And Implementation Of Automatic Construction System Of English-chinese Parallel Corpus
7	The Study Of The Alignment Method In The Chinese-English Parallel Corpora
8	Research Of Bilingual Sentence Alignment Served The Chinese-Uyghur Machine Translation System
9	Mining Bilingual Parallel Corpora From Web Automatically And Its Application In Statistical Machine Translation
10	A Study On The Key Technologies Of Web-Based Indonesian-Chinese Parallel Corpus Construction