| The construction of cross-language parallel corpora not only plays an important role in promoting the research of natural language processing,but also contributes to the application of machine translation and bilingual dictionary.In recent years,with the increasing research on humanistic computing in the field of humanities and the strategic popularization of Chinese culture going out,the construction of cross-language parallel corpus of pre-Qin classics plays an important role.It can not only serve as the main carrier of Chinese excellent culture transmission,but also as the basis of cross-language information processing.In the construction of cross-language parallel corpus,there are not only coarse-grained chapter alignment,paragraph alignment,but also fine-grained sentence alignment and word alignment.The construction of multi-level parallel corpus based on cross-language classics can further promote the research of new knowledge of pre-Qin classics,extract the entity words from the classics,and construct the knowledge map of pre-Qin classics,which plays a positive role in further promoting Chinese traditional culture.However,most of the existing ancient English bilingual corpora only achieve text alignment,and there are few fine-grained bilingual alignment studies.In terms of research objects,previous researches on multi-level corpus alignment include Chinese-English bilingual parallel corpora,Chinese-Uyganese bilingual parallel corpora,Tibetan-Chinese bilingual parallel corpora,etc.However,there has not been a complete and systematic study on alignment at paragraph level,sentence level and lexical level for ancient Chinese and English bilingual corpus.This paper takes the corpus of ancient Chinese classics as the data source,on the basis of combing the previous parallel corpus research;Combining the characteristics of ancient Chinese and English bilingual corpus,this paper explores the automatic alignment of bilingual parallel corpus at paragraph level,sentence level and lexical level.A multi-level parallel corpus of ancient Chinese and English is constructed,and a retrieval platform of parallel corpus of historical classics is built to realize the application of parallel corpus of ancient English.The research work of this paper is as follows:(1)Research on alignment of ancient English bilingual corpus at paragraph levelThe ancient English parallel corpus aligned with the text is obtained by the way of web crawling.Combining the characteristics of ancient Chinese and English bilingual corpus,we choose paragraph alignment based on classification and paragraph alignment based on carriage return.The length feature,alignment pattern feature and entity information translation feature are selected from the method based on classification.By comparing the experimental results with the method based on carriage return,it is found that the method based on multi-feature fusion is better than the method based on carriage return,and its F value can reach 97.6%(2)Alignment of ancient English bilingual corpus at sentence levelAfter paragraph alignment,sentence alignment is further studied.Combined with the characteristics of ancient Chinese and English bilingual corpus,the sentence length feature,alignment pattern feature,entity information translation feature and punctuation mark feature are selected as the experimental features of sentence alignment experiment.A comparative experiment of sentence alignment was conducted on 122590 pairs of candidate sentences using global classification and sequence labeling.The experimental results show that: First,the entity information features in Old English sentence alignment can improve the effect of sentence alignment;Second,the sentence alignment effect is better when the four features are combined.Thirdly,when considering the four features,the sentence alignment effect of LSTM-CRF model based on sequence labeling idea is better than the sentence alignment effect of SVM model based on global classification idea,and its F value can reach 94.34%.(3)Research on the alignment of lexical bilingual corpus in ancient EnglishIn the study of the alignment of ancient English words,the SIKU-Bert model was used to segment and proofread 25418 sentences from the experimental corpus.Two generative models,IBM Model2 and IBM Model3,were selected for the experiment based on the characteristics of ancient Chinese-English bilingual corpus.The experimental results show that the word alignment effect using IBM Model3 is higher than that obtained by IBM Model2,and the F-value is 68.98%.(4)The application platform of ancient English bilingual corpusAfter the research on alignment of paragraphs,sentences and words between ancient Chinese and English,the optimal model is used to automatically align 15 classical books at multiple levels,and a multi-level parallel corpus of ancient English is constructed,including 4746 paragraph pairs,25,418 sentence pairs and 11891 vocabulary pairs.On this basis,a parallel corpus retrieval system of historical classics is built,which realizes four functions of similar ancient texts retrieval,ancient English paragraph translation,ancient English sentence translation and ancient English vocabulary translation... |