Font Size: a A A

Improvement of the results' relevance of a web information retrieval system using automatic query expansion

Posted on:2009-09-18Degree:M.Sc.AType:Thesis
University:Ecole Polytechnique, Montreal (Canada)Candidate:Revello Giallorenzo, BeatrizFull Text:PDF
GTID:2448390002991163Subject:Engineering
Abstract/Summary:
Information retrieval is the technique used by search engines to retrieve documents. Query expansion is an information retrieval technique to ease the user's search. Given the user's query, the system will add some terms to the query in order to retrieve better results than the original user's query. PageRank is the algorithm behind Google's success. Taking advantage of the linked structure of the web, PageRank will associate a rank to each web page, measuring its popularity. A popular web page is a page referenced by many popular web pages. There is a technique named TF/IDF which is used in the vector model of an information retrieval system. It will rank the document's terms in order of rareness. The less a term appears in the document collection the higher the TF/IDF of that term.;The web information retrieval system we worked with is Google. The technique we used to improve Google's results is query expansion. We used the online encyclopedia Wikipedia to extract terms to expand the query. We extracted the Wikipedia links (set of terms), from the Wikipedia pages. We applied the technique of PageRank to the Wikipedia links to order the set of terms, instead of expanding the query adding the Wikipedia terms by order of appearance, we added them by order of page rank. Another technique we used is the one named QueryPagRank which will calculate the PageRank but of only those words related to the query.;We evaluated our approach in two different experiments. We first expanded the original queries adding Wikipedia links as set of terms by order of appearance, named Wikipedia phase. Second, ordering the Wikipedia terms by page rank, we expanded the original queries with the Wikipedia terms by page rank order, named PageRank phase. We also made an exploratory research, in which we calculated the TF/IDF of all the Wikipedia terms and expanded the original user's query by TF/IDF order, named TF/IDF phase.;To sumarize, we tested our system in two set tests, and contemplated one exploratory research. In the first test set we tried to improve the first ten Google results, we expanded 10 queries with 32 terms. The results were not very conclusive for any of the used techniques. In the second test set, we compare the results with the ten next pages of Google results. We expanded 30 queries with 32, 16 and 8 terms. Compared to Google next then results, we found an average of 11% in precision value. In the exploratory research, we wanted to test the quality of the terms that we added to the query to expand it, but manually, without sending the new query to Google. We ordered the Wikipedia terms by order of TF/IDF, per query. We tested for 30 queries, the results were not comparable between them, so we compared TF/IDF and PageRank expansion terms. We concluded that there is an inverse proportionality between TF/IDF and PageRank values.;The objective of this research project is to improve the result's relevance of a web information retrieval system using automatic query expansion.;An improvement of the first test set results would be to find the mathematical adjustment that would return those terms that we consider as suitable. To achieve this we might measure which is the range where we found the suitable terms by testing with some queries, and then find the mathematical adjustment to achieve it. Another improvement to perform might be to find the amount of terms to add to the query that return better results, maybe 32 terms is too much, and we are adding some noise to the query, alienating from the original user information request. We also question the TREC measures to establish whether a page has the right information for a given query. We could have induced bad results because of a bad query election. We also leaned in TREC to choose the queries, and we only chose proper nouns, which do not represent the all universe of user queries. An amelioration of the second test set would be to choose the more suitable number of terms depending on the used technique. To achieve this, we might run some more tests and see if the tested number of terms holds better results. For the exploratory research, we might suggest that the expected links are near the intersection between TF/IDF and PageRank values, which is around the value 10 (normalized).
Keywords/Search Tags:Query, Information retrieval, Results, TF/IDF, Terms, Used, Technique, Wikipedia
Related items