Font Size: a A A

The Research On Automatic Word Alignment Extraction For Greater China Region

Posted on:2017-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:X F XuFull Text:PDF
GTID:2348330485977092Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The writing forms and expression habits of the Chinese characters are different in mainland China, Hong Kong and Taiwan, a.k.a., the greater China region(GCR) because of the cultural and geographical differences. From the perspective of morphology, Hong Kong and Taiwan adopt the traditional characters, while in mainland China, simplified Chinese characters are used. In terms of semantics, there exists a lingual phenomenon that the same semantic meaning is expressed by different characters or phrases(allo-graphic synonyms). As a matter of fact, to automatically extracting the allo-graphic synonyms in GRC by computer, on the one hand can enrich the linguist's understanding on the differences of Chinese language in GCR, on the other hand can also help companies and government etc., better understand the text content of the Greater China region, thus provide an important basis for decision making.At present, the mainstream word alignment data and the computational models are bilingual-based and primarily concerned with the word alignment between two different languages, such as Chinese and English, Chinese and Japanese, Japanese and English. However, researches on similar languages such as dialects or language varieties are rare. Based on this, this paper mainly studies the automatic extraction of word alignment in the similar languages of the GCR.In this paper, the author first crawled the GCR parallel data from the Wikipedia and the news website with simplified and traditional encoding and extracted valid GCR parallel sentence pairs containing different characters by pre-processing technology. Then manual annotation and alignment were conducted by two senior graduate students majoring in computational linguistics, with an annotation agreement of more than 95%, thereafter forming the standard corpus for the succeeding computational models of word alignment. In addition, this paper proposed two computational models of automatic extraction of GCR word alignment. One is the 2-phase GCR word alignment model based on word2 vec representation of the GCR words' cosine similarity measure and other post-processing techniques integrating with words mapping rules. The other is the model based on the mapping rule of word alignment, which first combines the expression features of the GCR Chinese sentences and filters a part of words by the longest common subsequence, and then further extracts the GCR word alignment by the 1-1, 1-n, and m-n word mapping rules. The results of the experiments on the two above mentioned method on the annotated GCR corpus demonstrated that the two models of the GCR word alignment we proposed outperform the current GIZA++ and HMM-based models.This paper's contribution mainly concentrates on the following two parts.1) First, the GCR word alignment corpus on a large scale with high degree of consistency will effectively enrich the resources of the GCR word alignment data. The construction of this corpus can not only supply abundant corpus resources for the researches on computational models of the GCR word alignment, but also large amounts of materials for the linguistic researches on the GCR phrases, sentences, paragraphs and texts.2) Second, the two automatic extraction of word alignment based on word2 vec and the mapping rules take the features of similar languages into full consideration by adopting the longest common subsequence to pre-filter word alignment, employ the word representation and the mapping rules of 1-1, 1-n, m-n to extract the GCR word alignment. Compared with previous methods(HMM model, GIZA++ and their extensions), the approach proposed in this paper was proved effective by the experiment conducted under the GCR corpus, with an improvement by 2%-3% in terms of the recognition of the GCR word alignment corpus.Overall, this paper made a detailed study on the corpus construction of the GCR word alignment corpus and the computational models, proposed some solutions to relevant problems and designed corresponding algorithms and experiments. The results of experiments indicated that the approaches proposed in this paper are conducive to improve of the GCR word alignment recognition and to reduce the dependence on the large-scale trained corpus. The approaches also lay a significant foundation for the researches on the GCR word alignment and offer reference for similar researches.
Keywords/Search Tags:The Greater China Region(GCR), Word alignment, word2vec, Longest common subsequence, Mapping rules, Parallel corpus
PDF Full Text Request
Related items