Font Size: a A A

Han Lao Double Sentence Alignment Method Research

Posted on:2018-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q RangFull Text:PDF
GTID:2358330518961964Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The bilingual corpus stores the semantically consistent corpus resources and information in two languages,and is an important resource in the field of bilingual language processing.It has been widely used in the fields of Machine Translation,cross language information retrieval,word sense disambiguation,translation knowledge extraction and so on.Alignment is the core of bilingual corpus processing,the effect of alignment is directly related to the future of Natural Language Processing related work.Sentence alignment is a kind of text alignment based on sentence level.It is a technique to find out the semantic matching relation from bilingual corpus.According to the language characteristics of the Chinese-Lao,this paper focuses on the research on how to build the Chinese-Lao bilingual parallel corpus,how to select the text features of high quality and how to realize the Chinese-Lao bilingual parallel sentence extraction method of integrated multi-features.The following research work is carried out in this paper.(1)This paper explores how to construct a bilingual parallel corpus.It analyzes the distribution of parallel corpus based on Wikipedia based multi-language website,and sets up a set of the bilingual parallel corpus construction strategy,includes crawling bilingual corpus,text extraction and sentence alignment.(2)This paper analyzes the characteristics of Lao language and summarizes the similarities and differences between the Chinese-Lao bilingual syntactic structures.Finally,it selects a series of text features of the Chinese-Lao,including sentence length ratio feature,dictionary matching feature,word co-occurrence feature and digital feature,the next step for the Chinese old parallel sentence extraction work.(3)Based on the study of how to realize the extraction of the Chinese-Lao bilingual parallel sentence pairs,this paper puts forward a method of extracting bilingual parallel sentence pairs with multiple-features.The first Chinese bilingual text corpus to get the pretreatment by artificial filtering and screening the candidate for parallel sentence corpus,and based on the combination of guidance for understanding of the Chinese-Lao language bilingual syntactic structure characteristics,put forward a series of design for the Chinese-Lao bilingual language text feature,and training the support vector machine model and the maximum entropy model by combining the features.Finally,the effects of the two classifiers and the effect of each text feature on the alignment effect are compared through the experiments.The experimental results show that the support vector machine is more suitable for this method,and the precision of all text feature combination is 70.46%.
Keywords/Search Tags:the Chinese-Lao, sentence alignment, feature selection, support vector machine, maximum entropy
PDF Full Text Request
Related items