Font Size: a A A

The Construction And Research Of Chinese-uyghur Bilingual Comparable Corpus Automatic Acquisition System Based On Machine Translation

Posted on:2018-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:F PengFull Text:PDF
GTID:2348330533456559Subject:Computer technology
Abstract/Summary:PDF Full Text Request
comparable corpus as a natural language processing field has been paid attention by many scholars,which can provide basic resources for machine translation,crosslanguage information retrieval,search engine and other applications.With the development of Internet information,the paper provides a lot of convenience for the collection of corpus,the scale of the corpus is getting bigger and bigger,and Translation equivalence,translation of technical terms,translation of new words to the translation equivalents.The content of the translation of the content is also gradually paid attention to the researchers.Compared with the corpus,the corpus has the characteristics of large scale,high timeliness and wide content,and it has a large number of applications in the information resource and so on.Therefore,the construction of comparable corpus is very important.In order to meet the demand of Chinese-Uyghur language in comparable corpus,based on the analysis of the existing research work,this paper proposes to construct the Chinese-language comparable corpus automatic acquisition system.The system mainly completes the following four levels of function design and implementation: First,according to the diversity of web content,the system designed for Xinjiang ChineseUyghur bilingual website web content acquisition program and the use of machine translation system for Uighur translation The Second,the access to the network corpus to stop word,word segmentation and other pretreatment,the use of named entity identification,part of speech analysis of the text and score scoring lower text.Thirdly,the neural network model is used to classify and classify the text,which is more favorable to the calculation of similarity.Fourthly,the similarity calculation and indexing of Chinese-Uygur language are carried out by using keyword extraction technique and latent semantic analysis algorithm to achieve the purpose of obtaining Chinese-dimensional comparable corpus system.The system can be used in the network to obtain higher quality Chinese-language comparable corpus,compared with the traditional method of calculating the similarity of the text based on the theme,the system in the comparable document screening more advantages,the use of keyword extraction and potential The combination of semantic analysis improves the quality of comparable corpus.The method proposed in this paper has good performance,good coverage,high quality and suitable for constructing crosslanguage comparable corpus acquisition system.
Keywords/Search Tags:Comparable corpus, Chinese-Uyghur bilingual Corpora construction, Named entity recognition, Document similarity
PDF Full Text Request
Related items