Font Size: a A A

Study On Massive Information Retrieval Model Based On Quotient Space Theory

Posted on:2011-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:S B ChenFull Text:PDF
GTID:1118360305972944Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the widespread application of computers and rapid development of internet, the information scale is increasing at an exponential speed. There are two problems to massive information retrieval. Firstly, how to get useful information accurately from large scale information resource without searching it manually from lots of feedback that retrieval system returned. Secondly, how to realize an efficient retrieval method that can retrieval massive information rapidly. Under this background, massive information retrieval has aroused great interest, and become one of the major topics in the field of information retrieval.Inspired by human intelligent that can observe and analyze problem at multi-level and multi-granularity, quotient space theory unifies the structure of multi-granularity objects with the mathematical concepts of set and the space, and establish the object model to solve complex problems in practical engineering. Observing and analyzing problem at a bigger granularity, you may find the problem become simpler, and the solving speed become faster, especially in the case of solving large complex problems. In this dissertation, massive information as the object, quotient space theory as the tool, the problems of massive information retrieval based on quotient space theory are studied. The main content and innovation including as following:(1) Based on quotient space theory, the hierarchical structure of information resource and the method of hierarchical information retrieval are proposed, and the time complexity of hierarchical retrieval is analyzed. The information resource structure is changed from traditional single-layer structure to a hierarchical tree structure, and each node has been defined a feature value. This structure can reveal class characters of information resource at different levels, and transform information rapidly between different granularities, and make it easy to compare and compute between nodes or between node and the query vector. To increasing retrieval speed, traditional method depends on increasing processor's number, but hierarchical retrieval algorithm tries to get a retrieved field which is much smaller then the whole information space, by using a hierarchical stepwise refinement method. With the method of reducing retrieval field greatly, hierarchical information retrieval can solve the problems of massive information resource effectively.(2) To establishing the hierarchical structure information resource, the method of hierarchical structure creating, and the algorithm of multi-granularity document granulation are investigated. Using Agent technology and clustering technology, the dissertation proposed two methods to create hierarchy structure, and gave an ontology-based representation and storage method. Then, based on this hierarchy structure, equivalence relation and equivalence class are defined to establish quotient space of information resource and get the algorithm of document granularity. As the document granulation is strictly based on equivalence relations and equivalence classes in the process of creating quotient space, the hierarchical information database meets the "Guaranteed False Principle" of quotient space theory, which laid the data foundation for hierarchical information retrieval.(3) To solve the classification problem about massive document, the classification method for massive multi-class document is investigated from two aspects such as training speed and multi-classification problems. Firstly, based on the analysis of the traditional multi-class SVM, we used genetic algorithm to solve the code problem of the ECC-SVM, and a new efficient ECC-SVM based on genetic algorithms is proposed. Then, to solve the problem of SVM training speed for large-scale sample set, the reduction method for SVM training samples under the original sample space is proposed. A new distance measure called distance of k-nearest neighbors(k-DNN) is presented, and corresponding distance between/within classes is defined according to k-DNN, and the methods of noise identification and samples importance evaluation are proposed, and the algorithm of reduction SVM training samples is proposed. Taking value of the average distance between sample and the k nearest samples, k-DNN is a more general form of traditional distance, it can avoid the limitations of the traditional distance such as contingency, noise-sensitive and distribution-sensitive, and can make the distance between/within classes more reasonable. (4) The methods of personalized hierarchical information retrieval and dynamic multi-level user interest extracting are investigated. To get different information according to different users under different backgrounds, the model of personalized hierarchical information retrieval is proposed. Then, according to the hierarchical characteristic of the website, the algorithm of extracting dynamic multi-level user interest based on ant colony optimization is presented. The algorithm can provide easily more information about user interest at higher level, and can effectively overcome the limitations of the traditional mining that only obtains long-term interest and can not capture dynamic user interest, especially in the environment of complex dynamic internet.
Keywords/Search Tags:massive information, hierarchical retrieval, quotient space, document classification, user interest mining
PDF Full Text Request
Related items