Font Size: a A A

Multiword Expressions: Extraction And Applications

Posted on:2008-12-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y DuanFull Text:PDF
GTID:1118360242976146Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The NLP community has increasingly become aware of the problems that multiwordexpressions (MWEs) pose. MWEs are expressions with special meanings, which cannot be obtained from their component words. A typical natural language system treats each word as a lexical unit, but this treatment does not hold in case of MWEs for they have idiosyncratic interpretations that cross word boundaries. Thus, identification and applications of MWEs have been a major concern for scholars working in this area and these are, therefore, considered a pain in the neck.This dissertation focuses on the MWE extraction and its applications. Aiming at the features of monolingual and bilingual MWEs, the author proposes a set of approaches to extract flexible MWEs. They are inspired by gene sequence alignment in bioinformatics. These models combine the characteristics of natural language and some machine learning methods. As an application, MWEs are used as knowledge resources to improve the efficiency of word sense disambiguation by the interaction between resources and algorithms. Another application is the automatic conceptual graphic indexing which uses the term extraction technique as its basic groundwork. The creative work includes following aspects:1. The author proposes the Multiple Sequence Alignment (MSA) for the MWEs extraction on the motivation of gene recognition because textual sequence is similar to gene sequence in pattern analysis. This MSA technique is combined with error-driven rules, with the improved efficiency beyond the traditional methods. Firstly, it provides a guarantee for the MWE recall. Secondly, it uses the dynamic programming method to prevent candidates from combinational explosion, and provides a global solution to pattern extraction instead of sub-pattern redundancy. Consequently, it has accurate measures for flexible patterns. These advantages are also verified by the experiment.2. The author implements a hybrid model for bilingual multiword expression extraction. Both statistic and rule-based methods are employed into the system. There are two phases in the extraction process. In the first phase, lots of candidates are extracted from the corpus by statistic methods. The algorithm of multiple sequence alignment is sensitive to the flexible multiword. In the second phase, error-driven rules and patterns are extracted from the corpus. For acquiring high qualified instances, the manual work with active learning is also performed in sample selection. These trained rules are used to filter the candidates. Also bilingual comparisons are used in a parallel corpus. Parts of bilingual syntactic patterns are obtained from the bilingual phrase dictionary. Some related experiments are designed for achieving the best performance because there are lots of parameters in this system. Experimental results showed the our approach gains good performance.3. The author adopts a new word sense disambiguation method, called Multi-engine Collaborative Bootstrapping (MCB) and the collocation which is a kind of special MWEs is viewed as its knowledge resource. This model combines different types of corpora and also uses two languages for bootstrapping. MCB uses the bilingual bootstrapping as its core algorithm that leads to incremental knowledge acquisition. The EM model is applied to train parameters in a base learner. The feature translation model is improved by semantic correlation estimation. In addition, the author uses multi-engine selection to produce qualified starting seeds from parallel corpora and monolingual corpora. Those seeds that are generated through unsupervised machine learning approaches can also ensure bootstrapping effectiveness in contrast with manually selected seeds in spite of their different selection mechanisms. Experimental results prove the effectiveness of MCB. Some factors, including feature space and starting seed number, are concerned in our experiments because the EM algorithm is sensitive to starting values. Limitation of resources is also a concern.4. The author introduces Conceptual Graphics (CG) based indexing for book summaries. The term recognition technology is used during indexing. CG-based indexing is a kind of deep semantic indexing. It integrates all the isolated keywords into a whole meaning unit. At first, indexes the CG by manual work and gains experience in this task. The next step, searches for the right solution to the automatic indexing. The term recognition and automatic relation extraction are the foundation for the CG-based indexing. Experiments make some progress. Because CG-based indexing task integrates many core language technologies, the further research is still needed.
Keywords/Search Tags:Multiword expression, Multiple sequence alignment, Word sense disambiguation, Conception graphic, Machine learning
PDF Full Text Request
Related items