Font Size: a A A

Multi-word Expression Extraction Based On Chinese-English Bilingual Corpus

Posted on:2012-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y X HuFull Text:PDF
GTID:2178330335960185Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Multi-word expression is a relatively complete semantic unit constituted by several words with associated syntactic and semantic meanings which steps over the boundary the word with relative full meaning in the language. With the deepening and development of natural language processing, multi-word expressions has become a research hotspot in natural language processing. However, the languages studied are concentrated in the Indo-European. As Chinese doesn't have natural word boundaries and its word boundaries are vague, researches of Chinese multi-word expression are concentrated in the discovery of word combinations with specific structures. Therefore, through the research on Chinese MWE with aligned bilingual corpus, it can clearly determine the boundaries of the integral semantic units in Chinese by using the natural word boundaries in the Indo-European languages.Based on the above considerations, this paper proposes pattern-independent Chinese multi-word expression extraction method based on Chinese-English bilingual corpus. It can be clearly seen from the result of the research that it can get a good extraction result on the smaller corpus.The method consists of two phases:The first phrase is extracting the Chinese MWE candidates based on the bilingual corpus. Extract candidates of Chinese MWE by getting the many-one phenomenon through Chinese to English, with the relatively clear and complete information of the word boundaries in English, based on the Chinese-English bilingual corpus. Since it only uses the information of the corresponding lexemes in this phrase, it dose not take so many constraints for the structures. After extracting the Chinese MWE candidates, it can ultimately get the final Chinese MWE by selecting the candidates with a variety of techniques. Among them, it firstly uses rules to filter the noises which includes adjacency relationship, word quantity and other information, and then chooses statistical methods, which consist Mutual Information(MI), t-value and Log Likelihood Ratio(LLR) for further filtering.
Keywords/Search Tags:Multi-word Expression, Chinese-English bilingual corpus, independent pattern, word alignment, statistical methods
PDF Full Text Request
Related items