Font Size: a A A

Research On Information Retrieval Technology

Posted on:2008-03-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:S M WangFull Text:PDF
GTID:1118360215998555Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the spread and the rapid development of Internet, online information increases greatly.So, how to organize and process the large amount of this information becomes a challenge.The research of text classification and information retrieval helps people efficiently findtheir interested information online, which means helps people find what they truly needfrom increasingly information. Three aspects, which related to information retrievaltechnology, will be discussed in this paper.In first part, technique about text classification will be discussed. We will (1) proposesemantic category, and construct a dictionary of graphic structure, along with an algorithmfor this graphic structure. As a enlightening knowledge of text classification, the dictionaryimproves the ability of simulating illation and processing opening corpus of the system; (2)propose an algorithm, which imitates human's behavior, On one hand the algorithm isbased on the point that the information of an document can be tell by its title, so whenfeature vector is processed the algorithm enhances its weight; on the other hand, a weightparameterωvector is designed to simulate human's skimming and skipping behaviorfor calculating method of a document cluster center, and a weight of the feature that thereare more positive examples than negative ones is enhanced. The experiment shows: Thealgorithm greatly improves the performance of a text classification system.Questions about Web pages will be discussed in the second part, including: (1) Giving akey technique to weight the index in information retrieval. As for search engines aredesigned to find the Web pages, which the user need. In order to weight the index, weexplore the feature of the Web pages that written in HTML. The experiment demonstratesthat the precision is improved compared with the traditional method (tf-idf) when the recallis low.(2) Bringing forward a new concept "Topic Keywords Set" (TKS). As forinformation retrieval online, the objectives searched are Web pages, the feature of thesepages is that they often small, presenting just one subject. TKS along with the explorationof the words' relationship, by calculating distance between the user's query and TKS,re-sort the result list. (3) Query expansion is an efficient way in improving informationretrieval quality. And in query expansion the selection of expansion words is a crucial anddifficult step. By analyzing the words co-occurrence, we proposes a new method to evaluate words' relevance. With this method, selected expansion words are relevant withthe whole query, capable of representing the theme of query, and effient in improving theperformance, which proved by experiment.At last, a research on multimedia information retrieval, which based on content, will bediscussed. The discussion will be on basis of some different descriptors under the MPEG-7standard. According to above, we will: (1) propose a method, using dominant colordescriptor in MPEG-7, to extract the key frames from the scenes, along with an experiment;(2) give an experiment in key frame retrieval, taking advantage of the different searchingarea of dominant color descriptor and homogeneous texture descriptor; (3) apply the twoabove achievements into the material base of "CG(Computer Graphics) producingproject management system".
Keywords/Search Tags:Information Retrieval, Text Categorization, Words' Relevance, Search Engine, MPEG-7, Content-based Image Retrieval, Query Expansion, Machine Learning
PDF Full Text Request
Related items