Research On The Chinese Science And Technology Document Information Retrieval System Based On The Vector Space

Posted on:2008-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:H L Lv

Full Text:PDF

GTID:2178360215958225

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In the information retrieval system employed, the recall and precision are not high, when used in different document sets, the effects of information retrieval system are also different. In order to meet the needs of various document retrieval, the effects become lower. Therefore, it is more efficient to design different information retrieval systems rather than have these systems meet different needs of information retrieval. This thesis analyzes the structure of the most frequently used scientific document. According to their characteristics, the author aims to improve the various aspects of Chinese scientific document information retrieval systems.In this thesis, the scientific document is classified into five interdependent parts reflecting its contents: title, abstract, keyword, content and reference. Accordingly, different methods of word segmentation, keywords distillation algorithms and weight of document vector are employed.The thesis first analyzes the keywords indexing the document and improves segmentation dictionaries and stop lists based on the characteristics of keywords so as to enable it to index Chinese scientific document. Ambiguities can be recognized when different methods of word segmentation and reverse and forward maximum match methods have been employed. Some terms in the content are employed repeatedly and the ambiguity is not dealt with, which can not affect the term frequency, so the forward maximum match method is employed in the content.In the information retrieval system, the vector space model is employed as the retrieval model. The location space is set in the information retrieval system based on vector space. The title, abstract, keyword, content and reference are dealt with as a space respectively. Therefore the keywords can be retrieval to calculate the weight separately and to form the location vector. Then the location vector is used to construct the document vector which is composed of the document space matrix. The thesis normalizes the document matrix into the probability matrix, in hoping of reducing the perturbation of document matrix and the effects of lengthy document on the word frequency; the thesis testifies that the perturbation of probability matrix decreases to a great extent using the condition number. As the recall and precision are always employed to assess the information retrieval system, the author tries to propose a new method of to assess the retrieval system.

Keywords/Search Tags:

Information retrieval, Vector space model, Chinese word segmentation, Matrix perturbation, Offset distance

PDF Full Text Request

Related items

1	Improved Vector Space Model And Its Application To Document Classification System
2	Research And Implementation On Intelligent Information Retrieval Based On Classification
3	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
4	Research And Implementation Of Text Categorization System Based On VSM
5	The Design And Implementation Of A Chinese Organization Names Retrieval System
6	The Chinese Web Page Filtering System Based On Content Security
7	Html Tags And Chinese Segmentation-based Web Index And Implementation
8	Research On Chinese Word Segmentation For Large Scale Information Retrieval
9	Research And Implementation Of Chinese Word Segmentation System For Enterprise Information Retrieval
10	Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery