Font Size: a A A

Research On Collocation Extraction And Its Application In Information Retrieval

Posted on:2011-03-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F LinFull Text:PDF
GTID:1118360332956382Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important linguistic resource, collocation represents a significant relation between words. It is one kind of indispensable knowledge for further development in the field of natural language processing (NLP). Automatic monolingual and bilingual collocation extraction is not only important for linguistic ontology research but also for many NLP applications such as machine translation, information retrieval (IR) and cross-language information retrieval (CLIR) etc.Information extraction (IE) is a fundamental technique for automatically obtaining information from texts. Collocation extraction belongs to relationship extraction, is one task of IE. Many statistical machine learning models have been successfully applied in IE, this dissertation studies how to use these models to better extract monolingual and bilingual collocations. And then, term collocation relationship with both linguistics and statistical significance is separatedly used for query expansion and query translation in IR and CLIR.This thesis is arranged as follows:1. Collocation extraction combining multiple statistical association measures and classifiers. Traditional collocation extraction approaches use only one single statistical measure, they may not be optimal in that they can not take advantage of multiple statistical measures. In this thesis, we propose a logistic linear regression model that combines five classical lexical association measures: co-occurrence frequency, mutual information,χ2-test, t-test, and log-likelihood ratio. Besides, we submit the extracted candidate collocation pairs to the Google search engine, and the resulting page counts are used to simulate their frequencies in the corpora. As for some special collocation relationships, we also propose classifiers fusion method. Three kind classifiers support vector machine, conditional random fields and maximum entropy are combined using two different ensembling strategies, majority voting and weighted probability. Experiments show that classifiers fusion method based on meta-learning can take advantage of more useful evidences.2. Bilingual collocation acquisition. There is a strong correspondence in dependence relations in the translation between English and Chinese, despite great differences between the two languages. This thesis puts forward a new bilingual collocation translation model on the basis of statistical machine translation model. Statistical translation model and target languge model are trained separatedly with word-aligned bilingual parallel corpora and monolingual corpora. Experiments show that our new collocation translation model can make full use of monolingual and bilingual corpora to get an optimal compromise of precision and recall.3. Query expansion based collocation relationship. Different from previous methods that based on either WordNet or co-occurrence relations, we select term collocation relationship with both linguistics and statistical significance for query expansion. Another difference is that term collocation relationship is used to expand query model instead of document model. We also combine important term collocation relationship with local relevance feedback documents to further improve performance. Our experiments on three TREC collections show that this new type of collocation relation performs much better than traditional query expansion methods.4. Query translation based on bilingual collocation translation. Bilingual dictionary-based approaches to query translation have been the mainstream methods in CLIR. While it faces two main problems which is translation ambiguity and the incompleteness of the dictionary. This thesis presents two statistical models that focus on the resolution of query translation ambiguities. First, we extend the basic co-occurrence model by adding a decaying factor that decreases the mutual information when the distance between the terms increases. Second, we incorporate bilingual collocation translation model, in which syntactic dependence relations (represented as triples) are integrated. As for OOV translation, we also propose one approach which uses web feedback information. We evaluate our methods on two TREC benchmark collections, experiments show that each model can obtain significant improvements over traditional dictionary-based approaches.
Keywords/Search Tags:collocation, statistics fusion, classifiers fusion, statistical machine translaton, query expansion, query translation
PDF Full Text Request
Related items