Font Size: a A A

Optimization and effectiveness of N-grams approach for indexing and retrieval in Arabic information retrieval systems

Posted on:2003-04-11Degree:Ph.DType:Dissertation
University:University of PittsburghCandidate:AlShehri, Abdullah MohammedFull Text:PDF
GTID:1468390011485011Subject:Computer Science
Abstract/Summary:
This dissertation explores a valid alternative to the word method for indexing and retrieving Arabic documents based on a finite number of word fragments, called n-grams. Based on the statistical properties of n-grams in six Arabic corpora, an optimization function of the size of n-grams is developed in terms of entropy and computational cost of n-grams. Two n-gram-based methods for indexing and retrieving Arabic documents are developed: optimal n-grams and extended optimal n-grams. The former uses optimal size of n-grams and the latter eliminates noisy optimal n-grams and expands them to a higher order of n-grams. An experimental Arabic 1RS was designed to conduct several retrieval experiments on two Arabic document collections: the SACS and the Al-Raya. The retrieval results of the n-gram-based methods are contrasted and compared to the word method in terms of precision-recall scores. In addition, the study investigates the effects of centroid vectors on the retrieval effectiveness of the word-based indexing method and the n-gram-based indexing methods.; The dissertation concludes that the percentage of the actual occurrences of unique overlapping n-grams to the expected occurrences decreases sharply as the size of n increases. The overlapping n-grams are not distributed with equal probability but in a highly skewed fashion. The optimal n-grams are defined as 3-grams and the extended optimal n-grams are defined as e-3–5-grams. The retrieval and statistical tests results show that the 3-grams retrieval method illustrates its superiority over the word retrieval method and the e3–5-grams retrieval method on the SACS collection but it has problems with the number of irrelative documents retrieved. On the Al-Raya collection, both the 3-grams and e-3–5-grams retrieval methods provide similar retrieval effectiveness but the advantage of the latter is it retrieves few irrelevant documents. The centroid vectors of words, 3-grams, and e-3–5-grams reduce the retrieval effectiveness on the SACS collection and maintain the retrieval effectiveness and reduce the average number of retrieved documents on the Al-Raya collection.
Keywords/Search Tags:Retrieval, N-grams, Arabic, Indexing, Effectiveness, Documents, SACS, Method
Related items