Optimization and effectiveness of N-grams approach for indexing and retrieval in Arabic information retrieval systems | Posted on:2003-04-11 | Degree:Ph.D | Type:Dissertation | University:University of Pittsburgh | Candidate:AlShehri, Abdullah Mohammed | Full Text:PDF | GTID:1468390011485011 | Subject:Computer Science | Abstract/Summary: | | This dissertation explores a valid alternative to the word method for indexing and retrieving Arabic documents based on a finite number of word fragments, called n-grams. Based on the statistical properties of n-grams in six Arabic corpora, an optimization function of the size of n-grams is developed in terms of entropy and computational cost of n-grams. Two n-gram-based methods for indexing and retrieving Arabic documents are developed: optimal n-grams and extended optimal n-grams. The former uses optimal size of n-grams and the latter eliminates noisy optimal n-grams and expands them to a higher order of n-grams. An experimental Arabic 1RS was designed to conduct several retrieval experiments on two Arabic document collections: the SACS and the Al-Raya. The retrieval results of the n-gram-based methods are contrasted and compared to the word method in terms of precision-recall scores. In addition, the study investigates the effects of centroid vectors on the retrieval effectiveness of the word-based indexing method and the n-gram-based indexing methods.; The dissertation concludes that the percentage of the actual occurrences of unique overlapping n-grams to the expected occurrences decreases sharply as the size of n increases. The overlapping n-grams are not distributed with equal probability but in a highly skewed fashion. The optimal n-grams are defined as 3-grams and the extended optimal n-grams are defined as e-3–5-grams. The retrieval and statistical tests results show that the 3-grams retrieval method illustrates its superiority over the word retrieval method and the e3–5-grams retrieval method on the SACS collection but it has problems with the number of irrelative documents retrieved. On the Al-Raya collection, both the 3-grams and e-3–5-grams retrieval methods provide similar retrieval effectiveness but the advantage of the latter is it retrieves few irrelevant documents. The centroid vectors of words, 3-grams, and e-3–5-grams reduce the retrieval effectiveness on the SACS collection and maintain the retrieval effectiveness and reduce the average number of retrieved documents on the Al-Raya collection. | Keywords/Search Tags: | Retrieval, N-grams, Arabic, Indexing, Effectiveness, Documents, SACS, Method | | Related items |
| |
|