Font Size: a A A

Research On Data Retrieval Technology Based On Hybrid Index Structure In The DRC Of DOA

Posted on:2016-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:M Y XieFull Text:PDF
GTID:2308330461456235Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data, data is not only basis but also core. Constructing architecture based on the data, we can solve many architectural problem, such as system integration, system expansion, data management and so on. The DOA architecture is designed with the demand of the times. In the DOA system, all kinds of big data can be stored and administrated through the XML document that is used to record metadata and stored in the Data Registration Center (DRC for short). With the rapid growth of the XML metadata document, how to quickly search the XML metadata document has become the first thing of the Data Registration Center in the DOA. Besides, this is the topic of the paper.XML is a kind of markup language and has semantic structure. XML has become a de facto standard for transmitting, exchanging, or storing all kinds of data and information owing to its unique advantage of mark. The XML document can store not only numeric data but also text data. So the XML document has become the main form of metadata storage in the DRC. In the paper, the traditional vector space model which is used to query the keywords of the content is improved, to achieve the other way that we query the keywords of structure for this semi-structured document data. To realize the algorithm that the query match a fragment of the XML document similarly design the weight and weight vector for the structural keywords which is a fragment of XML metadata document, and achieve accurate retrieval. However, the structure of the XML metadata document in the DRC has many characteristics, such as variability, heterogeneous and complexity. How to construct index used to search the XML metadata, and how to efficiently use the hierarchical structural information of XML metadata in the retrieval are two big problem that must be resolved firstly in the paper.In order to achieve the XML metadata retrieval under the registry, the following work has been done.(1) Research on the nodes coding mode and the index of the XML document in the DRCIn the DRC, the research on the storage that is based on the XML metadata document and retrieval technology about how to search the XML document. In order to search the keywords of the XML metadata document efficiently, the index structure must be established for the XML metadata document. Besides, the coding mode and index structure of XML node are the basis of retrieval for the XML document. In order to code the XML node, the effective way of node coding is realized though the research on coding scheme of the XML node. Based on the coding mode, the research how to realize the effective scheme about index structure that support for simple keywords retrieval and structured query is done. All what have been done is for the goal to the quick retrieval of the mass metadata information in the DRC.(2) Research on the vector space model and processing algorithm of retrievalAttributed to the diverse structure of XML document, there are three kinds of XML metadata document, the one that is data-centric and structured of XML metadata document, the other one that is text-centric and semi-structured of XML metadata document, the last one between the first two of XML metadata document. In order to improve the efficiency retrieval of XML document, the models and methods of data retrieval that is suit for the XML document in the DRC should be studied. The study how to extend the traditional word retrieval based on content to support the keywords retrieval with structural and constrained route is done for the following aims. All the aims are based on the study about the traditional vector space model. The aims are to realize an extension vector space with subordinate relation, to be used for XML information retrieval matching score sort, to support information match and retrieval of XML document in the DRC. What’s more? The retrieval model can support different retrieval, such as word of content retrieval, word of label retrieval, word of structure retrieval.The innovation points and achievements are as follows:(1) Put forward a coding scheme and index structure of hybrid for different information retrieval suiting for XML document in DRC.Depending on the study about various coding scheme, the coding scheme of DADG(Dewey And DataGuides) is proposed. Some shortcomings and disadvantages of the prior coding scheme is overcome. And this scheme support coding technology for the content retrieval and structure retrieval of the XML document in DRC. Depending on the scheme, a hybrid index that combine DataGuides index, text index and element index is constructed. In this way, the storage space is saved and the speed of information retrieval is increased. The most important is that the scheme is effective support the information retrieval of XML document in DRC.(2) Put forward a vector space model and corresponding processing algorithm through improving the function in the exist model.The vector space model is expanded by expanding the content of keyword in traditional vector space model into constrained and structural keyword. The expansion is supported by the aforementioned coding mode and realized by the following work. Firstly, the keyword query of traditional vector space model and the exactly affiliation for the keyword of content in the XML document, into the function for the structure and path matching that is determined in the value between 0 and 1. So the keyword query of structure can match the fragments of XML document. Secondly, a valid fragment of XML document is returned by the scoring and ranking in weight. Lastly, the processing algorithm based on the structural query is realized by the retrieval model that was expanded.
Keywords/Search Tags:Data Registration Center, Hybrid Index, XML information retrieval
PDF Full Text Request
Related items