Font Size: a A A

A Comparative Analysis Of Approaches To Automatic Collocation Extraction

Posted on:2012-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhuFull Text:PDF
GTID:2155330335959527Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Collocations are important resources for second language learning and many natural language processing tasks, but automatic extraction of collocations from a corpus has always been a well known problem.Corpus-based automatic extraction of collocations is typically carried out by employing some kind of a statistical measure that indicates weather or not two words occur together more often than by chance. But when extracting collocations from a corpus, linguists always choose one of the methods randomly without concerning of the size of the corpora or the category of the corpora which always causes the deficiency in collocation extraction. In this paper an attempt has been made to evaluate the extraction efficiency of four kinds of algorithms (mutual information, chi-square test, t-test and log-likelihood ratio). Specifically, this study intends to study the following two questions:(1) For corpus of the same size but of different categories, whether there is any difference among the extraction efficiency of the four algorithms.(2) For corpus of the same category but of different sizes, whether there is any difference among the extraction efficiency of the four algorithms.The result reveals that:(1) For the corpora with the same size of two million words: the overall best result was achieved by mutual information for the academic corpus and press corpus; while for the fiction corpus, log-likelihood ratio performed the best.(2) For the corpora with the same category:the overall best results were achieved by log-likelihood ratio when the size of press corpus is smaller than one million words; but for the press corpus with the size larger than one million words, the overall best results were achieved by mutual information.
Keywords/Search Tags:collocation extraction, algorithms, corpus
PDF Full Text Request
Related items