Font Size: a A A

Enhanced English-Arabic Cross-Language Information Retrieval

Posted on:2009-08-09Degree:D.ScType:Dissertation
University:The George Washington UniversityCandidate:Amor-Tijani, GhitaFull Text:PDF
GTID:1448390002992184Subject:Computer Science
Abstract/Summary:
One of the main issues facing Cross Language Information Retrieval (CLIR) is untranslatable words, i.e., words not found in dictionaries, which are usually referred to as Out Of Vocabulary (OOV) words. Bilingual dictionaries in general do not cover most proper nouns (e.g., names of places, people, countries, etc.), which constitute a large proportion of OOV words. As they are often primary keys in a query, their correct translation is often necessary to maintain a good retrieval performance. Because they are spelling variants of each other in most languages, an approximate string matching technique against the target database index is usually used to find the target language correspondents of the original query key. The n-gram technique has proven to be the most effective among other approximate string matching techniques. A more complicated issue arises when the languages dealt with have different alphabets. The approach usually taken is transliteration. It is applied based on phonetic similarities between the languages involved. However, transliteration by itself cannot guarantee the exact spelling of the transliterated words as found in the document collection. There are a variety of ways that a transliterated word can be spelled despite conventions that might exist. The fact that there is no one correct way of spelling a transliterated word shows the need for a technique that is capable of generating the different spellings found in the document collection. In this study, we chose to combine both transliteration and the n-gram technique in an English-Arabic CLIR system, in which Arabic documents were searched using English queries. We evaluated the effectiveness of this approach and compared it with a statistical transliteration explored by researchers at the University of Massachusetts (UMass). We also explored two disambiguation approaches. In the first disambiguation approach, we tried to enhance our transliteration using POS disambiguation. As for the second disambiguation approach, thesaurus-based disambiguation was applied after the queries were expanded. WordNet was also used to explore what benefits a thesaurus-based disambiguation could bring to retrieval performance. This process is intended to filter out unrelated words and keep only expanded terms included in the definitions of the original query terms. This allows the final query to include only expanded terms that are directly related to the original query keys. Experimental results showed the great benefit of our transliteration approach and the improvement gained using disambiguation.
Keywords/Search Tags:Retrieval, Disambiguation, Original query, Transliteration, Words, Approach
Related items