Font Size: a A A

The Acquisition Of Parallel Sentence For Statistical Machine Translation Based On Internet

Posted on:2016-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:B W ZhangFull Text:PDF
GTID:2308330479490068Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Statistical machine translation(SMT), which was founded in the 90’s in twentieth Century, is a statistical based machine translation model. It can automatically extract translation rules from the bilingual corpus without the need of manual intervention. With the development of machine translation model based on statistics, it is becoming more and more simple to construct a relatively mature machine translation platform. But a mature machine translation model needs a relatively high quality language library as a support. It is a lot of manpower and material resources to build a large corpus if it is constructed by means of artificial construction. Especially for China’s minority languages, there are 56 people in China, with more than 80 languages, about 30 kinds of system. In different languages, different levels of automatic processing are different, which leads to the ability to obtain information from the large number of languages such as English and Chinese. And we believe that the asymmetry of long-term information exchange is one of the important factors that cause the difference in different regions and cultures. Whether it can be used in computer automatic processing of the way, or even eliminate the exchange of this information is to eliminate the asymmetry of information exchange, it is one of the problems to be solved in front of the vast number of scholars. Therefore, this article will automatically obtain the parallel data from the network, the parallel corpus can be processed to form a good translation model, which can be constructed from the point of view.The specific research contents and results of this paper are as follows:(1) based on the current network on minority languages exist to analyze the distribution, it is concluded that the characteristics of distribution of minority language network, through analysis of characteristics, combined with the characteristics of design and realize for minority language network, web crawler.(2) through in past dictionary extraction method to carry on the analysis, summarizes the advantages and disadvantages of dictionary extraction in the past research, combined with recent studies using label propagation algorithm will one-dimensional dictionary extraction method is extended to map of two-dimensional label propagation dictionary extraction method based on. And implements the corresponding dictionary extraction tool.(3) through the parallel sentence pairs features observed, combined with past research design based on sentence level and characteristics of parallel sentence on the quality analysis of sentences, and the corresponding parallel sentence pairs of quality analysis tools, the machine translation first step.(4) this paper is to design and implement a demonstration platform based on Chinese in the first few chapters of the paper, which is based on the hub translate, the pivot language translation method and the previous chapters.
Keywords/Search Tags:Corpus acquisition, Parallel sentence quality analysis, Label propagation, Dictionary extraction
PDF Full Text Request
Related items