Font Size: a A A

Research On The Key Technology Of Information Retrieval In Content Aware Network Storage System

Posted on:2013-11-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:1228330392957274Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the increasing development of information technology, heterogeneous resourcesgrow explosively in recent years and valid data is usually lost in the information oceanbecause of the rapid growth of information. It is hard to quickly locate the requiredinformation just relying on traditional query tools. As modern information retrieval canaccurately access the information from large-scale system in an efficient way, it isconsidered to be the best method to solve the above problem. Current researches focus onimproving intelligence of storage system, enhancing search capability of heterogeneousinformation and increasing query accuracy. As the relatively independent of storage andretrieval components, it is hard to understand the data content and optimize data layout forfast retrieval in storage system. In order to lead information retrieval technology intostorage research area and maximize the query efficiency, a new mechanism of informationorganization, indexing and retrieval has been considered. And an overall solution schemeis provided to integrate retrieval capacity into storage system.To address the lack of content aware capacity in storage system, an informationextension mechanism is proposed to transmit information through storage stacks. Theupper semantic information is extracted and stored as extended information. Then theadvanced metadata I/O channel based on the traditional data I/O channels transfers theextended data to the lower storage system. By analyzing the additional information,storage systems realize and use the upper semantic information to optimize the overallsystem performance. Based on the information extension mechanism, a content awarenetwork storage prototype system is implemented.In order to take advantage of the semantic information and the duplicate blockinformation to deliver efficient query service for users, a two-phrase retrieval strategy isintroduced. As the query requests in storage system are coming from two aspects, theformer one is metadata retrieval that delivered by administrator and the latter one is user’scommon keyword query. The indexing structure can efficiently enhance the queryperformance, but the functions of de-duplication and block similarity detection in contentaware storage system are not utilized to enhance the above query processing. Theproposed strategy combines metadata/keyword query with block similarity query andutilizes ranking coefficient to evaluate similarity among query results. Thus the retrieval algorithm has efficiently enhanced the retrieval recall.Propose an index partition mechanism and query cost model based on tiered storage.The index space is increasing as file number has increased significantly in storage system.However, not all of these indexes have the same access frequency, some of which willnever be retrieved after being generated. So index has been segmented according to theaccessing frequency, those inactive index will be stored in low-speed storage device tosave costs. Meanwhile, the index partition performance, index space cost and queryprecision have been considered.Propose a correlation graph construction method based on content hash to satisfythose query requests. It is well known that hierarchical structure is typically utilized toorganize and manage data in storage system. Specific information can be passed from onelayer to another through a standard interface in this architecture. It brings benefits to hidethe non-concerned information of each layer, while constraints the fluently informationmigrate between all levels. In order to establish a special hyperlink data structure instorage system and generate a global feature to meet the user’s complex query requests,the barriers between all levels have been broken to establish a stable correlation graph.Users can not get the desired results when those submitted query terms are too broador too fine. So the ranking algorithm in the two-phrase query mechanism needs to beextended. The enhanced algorithm modifies the information retrieval method to measurethe similarity query results in storage system. Meanwhile, correlation graph and blocksimilarity algorithm based on de-duplication technology are utilized to sort the queryresults. This kind of solution can better reflect the characteristics of the internal datastructure, as well as reducing the query failure rate and improving the recall rate.Guided by the above research methods, through prototype modeling, algorithmgenerating, theoretical analyzing and experiment verifying steps, content aware andinformation retrieval technology are integrated into storage system. Experiments indicatethat the storage intelligence and information retrieval capabilities have been rapidlyenhanced in content aware network storage system.
Keywords/Search Tags:Content Aware Storage System, Information Retrieval, Indexing Partition, Deduplication, Information Lifecycle Management, Similarity Retrieval, Correlation Graph
PDF Full Text Request
Related items