Font Size: a A A

Chinese And Vietnamese Bilingual Corpus Construction Based On Python

Posted on:2019-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2428330548973468Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the implementation of China's "Belt and Road" initiative,economic and trade cooperation with China and Southeast Asia with neighboring countries has become increasingly frequent.From the technical level,it has become a research hot point to solve cross lingual communication problems facing Southeast Asian and South Asian small languages.The construction of bilingual corpus is the foundation of Machine Translation,cross language information retrieval and text analysis.It has always been the focus of Natural Language Processing research.In recent years,with the development of Internet and Natural Language Processing technology,bilingual corpus construction technology is also progressing,and the acquisition and processing based on the Internet has become the main method of construction.However,for the small language bilingual corpus,because of the lack of resources,the acquisition and construction technology require to take full consideration of its characteristics,effectively excavate and utilize the limited resources in order to build a higher quality corpus.This paper takes the Vietnamese language as the construction object,takes the data mining technology and analysis technology as the breakthrough point,adopts the method based on cross language information retrieval(CLIR),and designs and constructs a Sino Vietnamese bilingual corpus based on the key technology research of four aspects: source corpus collection processing,keyword extraction,target corpus acquisition and cross language text similarity analysis.The main contents are as follows:1.Obtain and Process source language corpus: Analyze the characteristics and structures of web pages,write corresponding Python crawlers,download the daily news of Chinese Ministry of commerce website as the source language corpus.Then we ulitize string replacement,slice operation to remove redundant information,and save headline and content of news.2.Keywords extraction: On the basis of TFIDF algorithm,we introduce multiple feature factors to calculate weights and output significant words as keywords to improve the accuracy of keyword extraction.3.Target corpus acquisition: The extracted keywords can be translated into target retrieval word by online translation tools,so as to prepare for subsequent retrieval work.In order to make full use of the resources,finding the Vietnamese news website which is similar to the content of the website of the Ministry of Commerce of China,and analyze the structural differences.According to the features of the Vietnamese news website,the target corpus of the crawler acquisition is written.4.Cross-language text similarity analysis: In similarity analysis,in order to improve the efficiency of analysis and retrieval,this paper will output the target document,which includes with more retrieved words in articles,with a random way in order to complete the retrieval.We also adapt Machine Translation and introduce LSI model to process text semantics,then use TFIDF model to calculate the similarity of single language.The significance of this paper is to fully realize valuable information acquisition through the existing network resources and achieve the construction of Sino Vietnamese bilingual corpus.The experiment proves that the method used in this paper can help to improve the effect of keyword extraction and optimize the similarity calculation,thus improving the construction quality of the bilingual corpus in China and Vietnam.This paper is based on the research and functional extension of Python language and related open source toolkits,which the limitations of Python in small language Natural Language Processing have been solved,and it also provides some references for the study of Python language in the construction of small language bilingual corpus.
Keywords/Search Tags:Python, comparable corpus, TFIDF, text similarity analysis
PDF Full Text Request
Related items