Font Size: a A A

The Design And Implement Of Fast Indexing Files Structure And Full-text's Information Retrieval System

Posted on:2010-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:2178360302466155Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The emergence and development of World Wide Web (the Web) have brought abundant information to people, also many opportunities and challenges to the field of Web information retrieval. The rapid development of internet enables people to enter an age full of various kinds of information. A large quantity of digital information is generated, among which the text information is the most basic and frequently used form. People urgently need a high-efficient retrieval tool to find out their needed items from mass text information, so it is a problem worthy to be researched that how to efficiently store and search the non-structured data of the text. Among it, research on full-text information retrieval technology has become the hot spot of domestic and foreign scholars. The main content of this paper is the index file structure design and its realization within the full-text information retrieval technology.The full-text information retrieval is a retrieval method that the computer index program scans and sets up index for every word of the text, specifying times and positions it emerges in the text, sometimes with scoring, so when users inquire, the retrieval program will search according to the index set up in advance and feedback the result to users. This course is similar to that we search characters through index table of the dictionary.Full-text information retrieval method mainly consists of character retrieval and word retrieval. Character retrieval means establishing index for every character of the text and decomposing a word into the combination of characters when retrieving. For different languages, character has different implication. For example, in English, character and word are actually the same. While in Chinese, they are totally different. Word retrieval means establishing index for word, namely semantic unit, and retrieving by word, including synonyms. Word treatment of western languages, including English, is similar to character treatment, and synonym adding is easy. While for eastern languages, character and word need to be distinguished for retrieval. It is also a difficult spot for current full-text retrieval technology, especially Chinese full-text retrieval technology.The key of full-text information retrieval technology is the construction and realization of index file structure. The simple and efficient index file is the base of improving retrieval efficiency and effect. From the traditional information retrieval (IR) technology used to book search to the full-text information retrieval technology applied to large-scale search engine, it experienced great changes from structured data search to non-structured data search. During this course, the most principal problems include:I. Index file of which data structure should be established?II. How to establish index for billions of web pages?III. How to search index files and feedback demanded information to users timely? As commercial search engines, Google, Baidu and others have invested large number of manpower and material resources to develop more efficient index file structure. For current applications, inverted index method is used by many search engines for its high efficiency. How to further enhance the index effect and efficiency on the base of current inverted index file structure is always the important research subject of foreign and domestic search engine fields.Begins from the conception and model of full-text information retrieval, this paper introduces in detail the index system working principle of search engine and the index file structure of index system. Through introducing the traditional inverted index file and inverted list, the paper points out their insufficiency and the urgency of enhancing index and retrieval speed fundamentally. Aiming at this problem, the system makes improvement to the traditional inverted index structure and realizes its own inverted index file structure, formulating a more high-efficient index file system used in this system.This paper focuses on analysis and design of full-text information retrieval system. At present, information on the Web, especially text information, is expanding at high speed and provides people more and more resources, also bringing great difficulty in searching the information that is needed. It is the basic service of modern search engine to rapidly and accurately provide needed information for users. The full-text information retrieval system as well as its index file structure is the base of realizing this service. As a result, this paper proposes index file structure of the system aiming at text document and HTML webpage document, which are used most often on the Web at present. The dynamic balanced tree merging strategy and incremental index algorithm of this index file structure largely enhances index speed and satisfy the requirement of mass data index.Then from practical aspect, this paper develops a full-text information retrieval system,. This system is a Windows console application program based on visual studio 2005 development tool. It adopts object-oriented design method (OOD) and coupling among every component is low. Modules of both application layers and core layers can be individually modified, upgraded and replaced. The whole system can be divided into index subsystem, retrieval subsystem, memory subsystem and plug-in management subsystem.The result of index treatment is forming index file, which will be stored by memory system for inquiry during retrieval course. The plug-in management is to extend system function and increase system flexibility. The system applies the index file structure designed by previous content to enhance retrieval efficiency, making adequate background support for search engines to provide service to users.The system is a Chinese segmentation system of comparatively perfect function but still with some defects. Firstly, the scoring module needs to be improved. The scoring system used now is very complex. It is necessary to take further consideration of scoring standard so to make it more equal and effective. Secondly, it is needed to continue learning the compression principles and methods of other retrieval files and make optimization on their bases. Compressing inverted files are helpful to increase inquiry throughput, because reading and decompressing a compressed inverted index can probably save time of I/0 than reading an uncompressed one. Finally, should further research and improve the index merging algorithm of this system to make its index efficiency higher.
Keywords/Search Tags:Index file structure, Information retrieval, Full-text index, Memory optimization
PDF Full Text Request
Related items