Font Size: a A A

Algorithm Research For Text Information Retrieval Based On Web

Posted on:2005-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:M J ZhongFull Text:PDF
GTID:2168360125958543Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the number of documents on the Internet increases exponentially. One of important researches focuses on how to deal with these great capacities of online documents. Text information retrieval is a task that involves finding more relevant documents for a user query in a collection of documents. This thesis mainly studies the algorithms of information retrieval based on Web.Firstly, this thesis briefly introduces the development and technology regarding the information retrieval. Based on this, the content-based algorithm, the link-based algorithm and fusion-based algorithm about the information retrieval are analyzed. Secondly, in order to avoid low recall in content-based retrieval and topic drift phenomena in link-based retrieval, a new algorithm based on hyperlinks and anchors is proposed which combines the content-based with link-based retrieval algorithm. In this algorithm, Hub and Authority values are firstly calculated from the links between the web pages, and the relevant weight of each page is gained by matching link anchor or document content with query, and then ranks the retrieval results. The experiment results show that the new algorithm for IR has much higher precision and recall.In order to improve the precision and reduce the retrieval time, this thesis puts forward an information retrieval algorithm based on classification and key phrase extraction. Compared with traditional vector space model, this algorithm reduces time complexity and improves precision. The experiment results prove that the novel algorithm works well. Then a new criterion named ranking error is contributed to solve the problem that the traditional performance evaluation methodology can't evaluate the ranking results of the retrieved documents efficiently. The experiment results indicate that the proposed algorithm outperforms TF*IDF and interactive retrieval based on classification in ranking error.Combined with the proposed algorithms and techniques, an English domain-based full text information prototype is implemented on the basis of the information retrieval algorithm.
Keywords/Search Tags:Text Information Retrieval, Vector Space Model, Link, Anchor, Key Phrase Extraction, Recall, Precision
PDF Full Text Request
Related items