Font Size: a A A

Research On Chinese-Thai Bilingual Corpus Mining Method For Internet News

Posted on:2019-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:S Q SunFull Text:PDF
GTID:2438330563957644Subject:Control engineering
Abstract/Summary:PDF Full Text Request
The large-scale and high-quality bilingual corpus resources are required by machine translation and cross-language information processing technologies.However,there are few publicly bilingual corpora can be availed,especially the seldom used language(low-density language).How to automatically obtain a large amount of bilingual corpus is a hot issue that many researchers study and solve.Recent years,with the integration of the world economy,the Internet has been developing rapidly,tens of thousands of news pages can be used to excavate bilingual resources.Therefore,the research of mining bilingual corpora from Internet news has become a key research topic in the process of Natural Language Processing.After analyzing the current research status of bilingual corpus mining,this paper make some related works with comparable corpus technology based on Internet Bilingual News Web pages: creating Chinese-Thai comparable corpus ? extracting parallel sentences from comparable corpora?extracting named entity translation from comparable corpus.The concrete contents are as follows:(1)Creating Chinese-Thai Comparable Corpus.The keyword extraction and bilingual document similarity calculation are the main research point for constructing Chinese-Thai comparable corpus in this paper.In view of the shortage of current keyword extraction technology in the comprehensiveness of text topics,this paper proposes a method based on TFITF to extract keyword.First,train the bilingual topic model for calculating the weight of the vocabulary to the subject,then calculating the weight of the word to the document,and finally combining the candidate key words.In the next step,use the Chinese-Thai bilingual dictionary to translate the keywords into Thai language and submit them to the search engine,find the corresponding Thai documents to form candidate comparable document pairs,and then calculate the similarity of the candidate comparable document pairs,and filter the documents with high similarity to form a comparable document collection.Experimental results show that the accuracy of the keywords extracted from this paper is relatively high,and then the corresponding comparable documents can be found accurately.(2)Extracting Parallel Sentences from Chinese-Thai Comparable Corpora.In this paper,Extracting parallel sentences is regarded as a two classification problem.First,all possible parallel pairs are generated under the same theme comparable documents through the Cartesian product way.Then we select the candidate parallel sentence pairs according to the sentence length ratio and the number of translation words.In the end,the candidate parallel sentences are identified by classifier which is trained by the selected Chinese-Thai sentence features.The experiment proves that the selected features in this paper are useful for better training the classifier to improve the accuracy of the parallel sentence recognition.(3)Extracting Named Entity Translation from Chinese-Thai Comparable Corpus.A classification model which fuses multiple features is proposed for extracting Chinese-Thai named entity translation.First,the named entities are extracted from the Chinese document set and the Thai document set,and then the similarity of the candidate naming entities under different weights is calculated.Finally,the classifier is used to classify candidate naming entities.
Keywords/Search Tags:Comparable Corpora, Bilingual Topic Model, Key words, Parallel Sentence Pairs, Named entity Translation Pairs
PDF Full Text Request
Related items