Font Size: a A A

Design And Implementation Of Automatic Construction System Of English-chinese Parallel Corpus

Posted on:2019-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:S H HuangFull Text:PDF
GTID:2428330590475365Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of various researches in natural language processing,corpus,especially parallel corpus,is playing a more and more important role as a basic resource supporting natural language processing technology.Traditional parallel corpus construction methods rely entirely on manual selection,which are time-consuming and laborious.At the same time,the huge bilingual parallel resources on the Internet have attracted the attention of scholars.The work of automatically constructing parallel corpora by using bilingual Web parallel resources has gradually begun.However,there are still many challenges in how to accurately locate and extract bilingual parallel corpus in massive Internet resources and how to efficiently use these acquired corpus.In facing these challenges,the primary study in this thesis are described as the following parts:(1)Design and implemention of a method for discovering bilingual Websites.This method makes use of two external features(anchor text feature,URL feature)of a bilingual Website,constructs a query keyword dictionary.Then uses search engine to search keywords in the query keyword dictionary one by one.Finally,the URL of the bilingual Website is obtained by parsing the Webpage URL of the search result.(2)Design and implemention of a method for extracting and verifying mutual translation of bilingual Webpage pairs.This method crawls all Webpage pairs which meet the URL pattern of bilingual Webpage pairs in a bilingual Website by using the depth-first search.Then extracts the features(structure features,content features)of the Webpage pair to make a feature vector.With the feature vector,the trained classifier will verify whether the Webpage pairs are translated.(3)Optimization of sentence extraction and alignment methods for the text in bilingual Webpage pairs.The method firstly extracts the text of the Webpage pair line by line after the Webpage is aligned according to the DOM tree structure.The method firstly extracts the text of the Webpage pair line by line after the Webpage pair are aligned according to the DOM tree structure.Then use the HTML elements to generate text-aligned anchors.Then the HTML elements are used to generate text-aligned anchors.Finally,texts between anchors are aligned by using the lexicon-based sentence-alignment method.(4)Development of the corpus search platform.A two-way index of Chinese and English was established for the corpus.With these index,the courpus search service was built.
Keywords/Search Tags:Parallel Corpus, Corpus Automatic Construction, Bilingual Website, Bilingual Webpage Pair, Sentence Alignment
PDF Full Text Request
Related items