Font Size: a A A

A Study On The Method Of Constructing Bilingual Corpus In Chinese And

Posted on:2017-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:X H LiuFull Text:PDF
GTID:2278330488465650Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Comparable corpus has always been a hot and difficult issue in the field of Natural Language Processing. It is important and fundamental for the application of statistical Machine Translation and cross language information retrieval. In recent years, along with the the opening to the outside world especially Southeast Asia, the communication of politics, culture and economy between China and Cambodia have become increasingly frequent, while language barriers become a stumbling block to the development of cooperation between us. Under the circumstances, Kampuchea -Chinese language information processing is also becoming more and more important. Currently, the research of Kampuchea language used by 14 million users is just in its infancy, and the Khmer-Chinese bilingual language research work has been carried out very few. Due to the diversity and complexity of the network information format and the bilingual sites about Cambodian and Chinese are few on the Internet,so getting a parallel corpus of certain scale and high quality is difficult through the network; while construct large-scale bilingual parallel corpora artificially needs experts who understand both Cambodian and Chinese language. It is time consuming and costs high. In contrast, the bilingual comparable corpora have rich resources, from which we can extract named entity, bilingual terminology, parallel sentences and many other translation knowledge. There is no lack of parallel corpus on timeliness. Bilingual corpus can be applied widely as an important basis for the resources, the scope of its application are also widely used in cross language information retrieval, cross language tendency analysis, statistical machine translation and other related tasks, or in extracted translation knowledge of lexicography and other related work.In view of the above questions, on the basis of studying and analyzing the existing research work, this paper carried out the study of the methods of Chinese-Cambodian comparable corpus construction, completed the following featured research:1) Keyword Extraction for Web News Documents Based on LM-BP Neural NetworkIn view of the actual demand, the paper provides a new idea on keyword extraction for web news documents by adopting the improved LM algorithm based on BP artificial neural network. First, preprocess the web news documents which are of consistent HTML format. The preprocessed work includes noise filter, web content extraction, word segmentation, POS tagging, stop words removal, etc. Also, select effective features like TF, location of words based on the characteristics of news documents. Then the selected features will be considered in training and constructing the BP neural network. Finally, extract keywords with LM algorithm which has parameters adjustment and solves training too long and getting stuck in local minimum of BP so that improve network convergence speed and keyword classification performance. The LM algorithm has better effect and convergence performance comparing with BP in the field of keyword extraction.2) Khmer-Chinese Bilingual LDA Topic Model Based on DictionaryMultilingual probabilistic topic models have been widely used in topic mining area in multilingual documents, this paper proposes a method called KCB-LDA (Khmer-Chinese Bilingual Latent Dirichlet Allocation) based on Bilingual dictionary. With the bilingual attribute of entries in dictionary, this method first maps the words expressing same semantic meaning to the concept abstract layer, then group concepts into the same topic space. Finally, documents in different languages will share the same latent topics. The same topics can be represented in both Chinese and Khmer jointly when given a bilingual corpus by the introduce of concept abstract layer. The experimental results show that the topic mining effects of KCB-LDA model are much better than the MixLDA model which is a monolingual LDA trained on concatenated documents from aligned document pairs.3) Comparable corpus acquisition method based on text hierarchical clustering By introducing the Khmer-Chinese bilingual LDA model, texts can be modeled by the topic model from which topic probability distribution of texts can be achieved. Compared to traditional clustering methods which rely entirely on word frequency and other basic characteristics as a basic clustering analysis, we introduce the semantic relations in text clustering analysis which combines text modeling and text clustering. Firstly compute the the JS distance with the probability of text distribution, take the reciprocal and fusion characteristics of various elements to caculate text similarity; then cluster the bilingual text with the improved type of hierarchical clustering algorithm, as each clustering results contained the texts which have similar contents, themes and large text similarity. Finally, we can get comparable corpus from each cluster. Compared with the method of calculating the similarity of the text based on the billingual dictionary, the results show that our method is effective.
Keywords/Search Tags:Comparable Corpus, Keywords Extraction for Web News, Bilingual Latent Dirichlet Allocation, Cross-Language Text Similarity, Hierarchical Text Clustering
PDF Full Text Request
Related items