Font Size: a A A

Research On The Chinese Science And Technology Document Information Retrieval System Based On The Vector Space

Posted on:2008-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:H L LvFull Text:PDF
GTID:2178360215958225Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the information retrieval system employed, the recall and precision are not high, when used in different document sets, the effects of information retrieval system are also different. In order to meet the needs of various document retrieval, the effects become lower. Therefore, it is more efficient to design different information retrieval systems rather than have these systems meet different needs of information retrieval. This thesis analyzes the structure of the most frequently used scientific document. According to their characteristics, the author aims to improve the various aspects of Chinese scientific document information retrieval systems.In this thesis, the scientific document is classified into five interdependent parts reflecting its contents: title, abstract, keyword, content and reference. Accordingly, different methods of word segmentation, keywords distillation algorithms and weight of document vector are employed.The thesis first analyzes the keywords indexing the document and improves segmentation dictionaries and stop lists based on the characteristics of keywords so as to enable it to index Chinese scientific document. Ambiguities can be recognized when different methods of word segmentation and reverse and forward maximum match methods have been employed. Some terms in the content are employed repeatedly and the ambiguity is not dealt with, which can not affect the term frequency, so the forward maximum match method is employed in the content.In the information retrieval system, the vector space model is employed as the retrieval model. The location space is set in the information retrieval system based on vector space. The title, abstract, keyword, content and reference are dealt with as a space respectively. Therefore the keywords can be retrieval to calculate the weight separately and to form the location vector. Then the location vector is used to construct the document vector which is composed of the document space matrix. The thesis normalizes the document matrix into the probability matrix, in hoping of reducing the perturbation of document matrix and the effects of lengthy document on the word frequency; the thesis testifies that the perturbation of probability matrix decreases to a great extent using the condition number. As the recall and precision are always employed to assess the information retrieval system, the author tries to propose a new method of to assess the retrieval system.
Keywords/Search Tags:Information retrieval, Vector space model, Chinese word segmentation, Matrix perturbation, Offset distance
PDF Full Text Request
Related items