Font Size: a A A

Tibetan-Chinese Bilingual Parallel Corpus Construction Method And Key Technology Research

Posted on:2019-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:S Z M BaFull Text:PDF
GTID:2348330566466246Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid popularization of Internet technology in Tibetan areas and the development of Tibetan information technology,many domestic and foreign research institutes and universities have started research on Tibetan language information processing technologies.Recalling the history of Tibetan information processing research,its research is mainly divided into two aspects,one is the study of the character processing level,that is,the input,storage and output of Tibetan characters;the other is the study of the level of language processing.At present,the research hotspots of Tibetan information technology are turning to the language processing based on "words","sentences","paragraphs",and "papers".The construction of Tibetan-Chinese bilingual parallel corpus is also belongs to the research of this level.The topic of this thesis caters to the development trend of Tibetan information processing technology.The construction of Tibetan-Chinese bilingual parallel corpora is an important basic work in the field of Tibetan-Chinese machine translation and bilingual comparison.The scale and quality of bilingual corpus directly affect the Tibetan-Chinese machine translation results.In recent years,with the extensive application of big data technology,Tibetan-Chinese bilingual parallel corpora have become increasingly important in the Tibetan information processing field.However,related research and introduction mainly focus on the application of bilingual parallel corpora,parallel to large-scale Tibetan-Chinese parallelism.There is less concern with corpus construction techniques.In the field of natural language processing,English-Chinese bilingual parallel corpus construction technology has developed very rapidly.However,the study of Tibetan-Chinese bilingual parallel corpus construction is still in its infancy.The size of corpus and related technologies all have great research space.Therefore,the selection of this paper is very important.The topic has important research and application implications.This paper based on the study of the Chinese-English parallel corpus building methods at home and abroad,combines the characteristics of the Tibetan language itself,and finds out the key issues in the construction of Tibetan-Chinese bilingual parallel corpus aligned in different levels such as chapters and paragraphs;Corpus construction method;using the proposed method to achieve the establishment of different levels of alignment of Tibetan-Chinese bilingual parallel corpus,through the test to achieve the desired results.The main work of this article is as follows:1.Analyzed the common methods of construction of bilingual parallel corpus between English,Chinese and other ethnic minority languages.Combined with the characteristics of Tibetan text body,this paper proposed the hierarchical structure of Tibetan-Chinese bilingual parallel corpus construction and designed the overall construction of bilingual corpus of Tibetan-Chinese bilingual.2.Researched several methods of collection and pretreatment of Tibetan-Chinese bilingual corpus,and selected the most effective method for collecting bilingual corpus in this article to complete Tibetan-Chinese bilingual corpus collection,and normalized and deleted Tibetan-Chinese bilingual corpus,preprocessing such as network tagging.3.Extract the characteristics of texts such as the topic of the article,the number of paragraphs in the article,the number(time,quantity,etc.),and abbreviations in the Tibetan-Chinese bilingual text.Based on the characteristics of texts,the topic-based method for calculating the similarity of bilingual texts and the method for calculating similarity of bilingual texts based on subjects and features are studied.The two methods were applied to the alignment of Chinese and Tibetan language text alignment.Two methods were programmed and compared.The experimental results of the two methods were compared.It was found that the “comprehensive method of calculating similarity of bilingual texts based on subjects and features” had the best comprehensive effect.Choosing "based on the similarity calculation method of subject and feature" completed,the text-aligned bilingual corpus construction.4.Based on the text-aligned Tibetan-Chinese bilingual parallel corpora,the bilingual corpus based on carriage returns is used to segment the text-aligned bilingual corpora,and bilingual parallelism between Tibetan and Chinese is achieved by calculating the similarity between bilingual passages,paragraph alignment in corpus construction.5.On the basis of paragraph-aligned Tibetan-Chinese bilingual parallel corpus,we first use simple sentences such as the period,question mark,exclamation mark,and Tibetan single-character symbols as the borders of sentences to divide the sentences.Then we study the length-based bilingualism."Sentence alignment method" and "Word-based double-sentence sub-alignment method" were programmed to implement the two methods and the results were compared.The "word-based double-sentence sub-alignment method" with good effect was chosen to achieve bilingual parallel corpus construction in both Tibetan and Chinese languages,the sentences in the alignment.6.Based on sentence-aligned Tibetan-Chinese bilingual parallel corpora,the "statistical bilingual word alignment method" and "dictionary-based bilingual word alignment method" were studied.The second method was chosen to implement bilingual parallelism between Tibetan and Chinese,the alignment of words in the corpus and analysis of the experimental results.Based on the existing research foundation,this paper has achieved the following results:1.Based on the Tibetan-Chinese real texts on the Internet,a Tibetan-Chinese bilingual corpus is established to collect and preprocess texts;2.Research the method of different levels alignment of bilingual texts,paragraphs,sentences,words,etc.,compare different methods,combine the characteristics of Tibetan language itself,put forward the method of different levels of Tibetan-Chinese bilingual corpus and apply it to practice;3.Established Tibetan-Chinese bilingual alignment data at different levels of chapters,paragraphs,sentences and words,laying the foundation for future research.
Keywords/Search Tags:Tibetan-Chinese, parallel corpus, paragraph passage, sentence and word alignment
PDF Full Text Request
Related items