Font Size: a A A

Research And Application Of Key Technologies Of Chinese-english Parallel Corpus

Posted on:2022-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:S WenFull Text:PDF
GTID:2518306488971839Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the wide application of big data technology,corpus technology has developed rapidly in all kinds of languages and made great progress in many fields.In the field of natural language processing,the Chinese-English bilingual machine translation has developed rapidly,but the development of English and Chinese bilingual corpus difficulties still exist,mainly participle corpus collection,corpora and corpus alignment,still need to study,each process that may affect the quality of the corpus,only in-depth research corpus every part of the building,In order to better construct the usable corpus.This paper mainly studies and designs the collection of corpus,the segmentation of corpus,the alignment of corpus and the application of corpus,so the topic selection of this paper has very important research and application significance.The main work of this paper is as follows:(1)the collection and pretreatment of Chinese and English bilingual corpus.In order to improve the efficiency of corpus collection,we design a web crawler algorithm.The crawler algorithm is mainly designed according to the characteristics of HTML,and the bilingual corpus is collected.Finally,the bilingual corpus is preprocessed,including the unification of font and paragraph typesetting formats,the removal of various network marks,and the normalization of English coding.According to the actual test,the crawler algorithm greatly improves the speed of corpus collection.(2)the Chinese and English corpus word segmentation.Aiming at the difficulty of traditional Chinese word segmentation in ambiguous words and unknown words,this paper designs a new word recognition algorithm to improve the recognition rate of ambiguous words and unknown words.Firstly,some new words are selected through named entity identification and then the corpus is filtered.After the filtering,the corpus uses N-gram word frequency statistics to delete low-frequency words,and then the new words are further screened by computing information entropy and left-right adjacency entropy.The identified new words are constructed into a neologism dictionary,which is combined with jieba word segmentation algorithm to segment the words in the corpus.According to the selected MSR and PKU data sets,we conducted tests,and the experimental results show that the improved Jieba word segmentation algorithm designed by us based on dictionary has a good improvement in word segmentation accuracy and new word recognition rate.(3)Align Chinese and English corpora to realize the construction of parallel corpora.According to the characteristics of corpus paragraphs,this paper designs a paragraph alignment algorithm based on paragraph markings,and uses cosine similarity to test and deletes the ones with low similarity.After paragraph alignment,this paper designs a sentence alignment algorithm that adds feature information to the dictionary.First,the sentence is divided by the special symbol of the sentence,and then the number of matching words in the anchor dictionary and the common dictionary is counted,the weighted value is evaluated,and the threshold value is set as the alignment criterion.The experimental results show that the proposed algorithm improves the accuracy of the dictionary sentence alignment algorithm.(4)Build a machine translation system.Through building in front of the Chinese-English bilingual corpus,this paper designs a LSTM neural network based machine translation system,first using the structure of the Chinese/English bilingual parallel corpus training translation model,after training the model using the Flask will be packaged into API translation model,and construct the translation interface,by calling the API to realize the machine translation,through the actual test,this translation system has realized the Chinese-English translation.
Keywords/Search Tags:parallel corpus, web crawler, sentence word alignment, chinese word segmentation, machine translation
PDF Full Text Request
Related items