Font Size: a A A

Similarity Search Based On Textual Content

Posted on:2011-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Clotilde UwimanaFull Text:PDF
GTID:2178360308468554Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Today, the ultimate way for people to search and locate information is the use of search engines. Web searching has become one of the major tasks performed by a large population of the web. Though web searching has been a great success and an effective manner for retrieving information, yet methods for retrieving different kinds of information are needed for various applications. Understanding the goals behind web searches provide an outlook for future improvements on web search engines. With the need for recommendation information to support users in decision making and to save them from a tedious work of browsing or reading through an entire collection when looking for similar objects to the query object, our work is based on analyzing the content of pages to retrieve information about similar objects.This thesis looks into the concept of web searching, focusing on similarity search technology. Similarity search refers to searching for objects similar to a query object. Given a user query, which is an object, the system searches through the web to find similar objects that are relevant, meaning objects having common attributes or properties with the query object. From a scenario of a user who is seeking for information about similar places, a new approach is modeled and analyzed with the challenge of determining the properties of places from a large collection of documents with non well-structured information.This paper evaluates techniques which are suitable to find results of similar places to initial query. We propose an approach based on terms extraction where we link the initial query place to its similar places through the terms that occur frequently in its search results. The extracted terms (top k-terms) are deemed to be the common properties and are used as the subsequent query performed by the system to get the final results. However, not only the weighting of terms will allow us to get results for similar places, we also need to carry a check on the results returned by the top-k terms query in order to eliminate documents that are more relevant to the initial query since we are looking for results of similar places rather than results of initial query. The performed evaluation proves that the approach respond to the users'information needs. The method retrieves relevant properties and yields good precision. The analysis also revealed the importance of filtering out documents relevant to the initial query to improve relevancy. We also find out factors that affect the performance; the nature of the query and the number of terms selected as properties of initial query play an important role in the relevancy of final results.
Keywords/Search Tags:Information search and retrieval, web searching, similarity search, textual content
PDF Full Text Request
Related items