Font Size: a A A

The Study On Ranking And Similarity Calculation In Information Retrieval

Posted on:2009-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:P YanFull Text:PDF
GTID:2178360245995010Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the continuous development of social informatizing course, the information needs of people are increasing. How to access useful information fast and efficiently has become focus of people. The research on information retrieval can help people find interesting information effectively, and help them get useful knowledge.The core issue of information retrieval is the prediction of the relevance of documents, and the ranking of documents according to their relevance. In general, the one on the top is considered the most relevant. Therefore, the calculation of relevance and ranking algorithm has become the main issue of information retrieval. Traditional information retrieval mainly used vector space model, which is also used in Web information retrieval, to calculate the relevance. But compared to ordinary documents, Web pages have lots of unique features, such as URL, HTML tag, anchor text, in degree. Meanwhile, there're hyperlinks between web pages, analyzing the links can improve the ranking of search results. The Deep Web is a special kind of Web resources, whose information is stored in databases, users can visit these databases just through some pages with database forms, but the text content in these pages is less, and the links between the pages are fewer, if we still use relevance method for general Web pages, we will get very poor results.This paper focused primarily on Web and Deep Web information retrieval field, focused on these following aspects:1. We built a full-text retrieval system, based on vector space model. We tested how to use HTML tag, anchor text, in-degree features to improve the calculation of relevance on this system. And we analyzed the URL feature of web pages, developed a re-ranking method of search results. The system performed well in SEWM2007.2. For the feature of links between web pages, a topic oriented page rank algorithm is proposed. The new algorithm takes the following factors into account, i.e. the relativity between the content of a web page and the topic, the classification of the links of web pages based on topics, and the importance of the web pages themselves. Experiments show that for two given topics the new algorithm is better than PageRank algorithm in terms of P@10 and users' acceptance. 3. Two methods of calculating semantic relevance between Deep Web databases are proposed. The 1st method is based on vector space model, but the semantic distance between two databases are calculated based on both the distances between the content texts of the HTML pages and the distance between database forms embedded in the pages. Hierarchical fuzzy sets are used, and an unification processing for database attributes is proposed, the processing is to let the attribute labels that are closed semantically be replaced with delegates. The 2nd method is based on theory of ontology and fuzzy sets, the database forms are translated from vectors to concept fuzzy sets and the similarity between databases are calculated by necessity degree of matching between fuzzy sets. Categorizing and clustering algorithm is used respectively to test the new methods. Experiments show that the two new semantic methods perform better than traditional ones.
Keywords/Search Tags:relevance calculation, link analysis, ranking, semantic similarity, degree of matching
PDF Full Text Request
Related items