XML Keyword Search Based On Result Type Grouping

Posted on:2012-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Wang

Full Text:PDF

GTID:2218330338473212

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet applications, web has become a huge mass of information space. Faced with such vast and complex information resources, we can not just use our own ability to obtain valuable information; however, we must use some external tools to obtain. Therefore, web search engine has come up, which played a very important role for people in getting information needed from the Internet. However, with the sharp increase in the amount of information and the diverse kinds of information, the existing web search engine has been unable to meet the increasing demand for obtaining information. Currently, the most popular search engine on the network is based on keywords, the query results of which are the entire HTML pages. The query result not only contains the information needed, but it is also contains a number of valueless information, such as advertising. The query results of information retrieval based on XML data only contains the information needed, which are thousands of pieces of data relevant to the objectives. Currently, information retrieval based on XML data has made some progress in the structure query work, such as XQuery. However, compared with the structured query language, the main advantage of XML keyword search technology is that users do not need to learn complex query languages, nor need to understand the underlying data structure of XML documents, the user only need to enter some keywords he interested, and then XML keyword search engine can be completed. Therefore, XML keyword search becomes one of the hot spots.This paper argues that the logical structure of a complete model of information retrieval can be divided into two parts:the one is how to obtain query results, and the other is the similarity of ranking search results. In order to achieve the above two parts, we also need some public infrastructure. First, scince the uniqueness of the XML document's structure, we need code each node in XML document, which can not be only a unique identifier for each node, but also shows the structural relationship between nodes. Therefore, the paper selected Dewey encoding XML documents. In additional, we use it to complete some simple operations between the nodes. Second, in the realization of the search engine, we will use some information and the corresponding node data, which is called inverted index, so we need a suitable container to store the index. Considering the embedded database can make inverted index and link seamlessly with the application process, we use embedded database Berkeley DB to make it come true, which makes the inverted index with the application running on the same address space, eliminating the customer machine server configuration related overhead, and the application does not require prior connection with the database service to establish the network, but embedded in the program through the Berkeley DB libraries in the data to complete the save, query, modify, and delete files. In this way, we can ignore during the experiment time to obtain inverted index, which weakened the inverted index on the negative impact of the main experiment.In terms of obtaining query result, the paper introduces several important semantic based on XML document and its corresponding query processing algorithms. Then, by comparing the query results of these semantics, and suming up their problems, the paper presents high-quality search results must have three rules. Based on these three rules, we propose a new concept. Firstly, from the macro entropy weighting method is used to determine the type of query results, which makes the search results found for the basic intent. And then we group all the XML nodes from micro, which makes sure that each group contains the complete information. In addition, we also designed a set of comparative experiments, which contain both search quality and the efficiency and stability of obtaining algorithm. The experimental data shows that these three rules and entropy weighting method to determine the results of the query make a high feasibility. In terms of relevance rank, the paper introduces traditional similarity measure method based on flat document, which is foundation of the similarity measure study; and the latest proposed XML-based similarity measure, which take the XML document structural characteristics into account, but scince the algorithm uses recursive thinking, there is a certain defects in efficiency and stability. In view of this, the paper designs a new XML-based similarity measure method which based on traditional flat document similarity measure method and the concept of virtual groups. The method not only takes the structural characteristics of XML documents into account, but also limites calculating the scale in a controlled range. To prove the validity of the algorithm, the paper conducts comparative experiments from the ranking quality and the efficiency and stability of ranking algorithms. The experimental data shows that the method on the efficiency and stability has been significantly improved.

Keywords/Search Tags:

entropy weighting method, result type, virtual group, XML

PDF Full Text Request

Related items

1	The Research Of Text Feature Weighting Method Based On Information Entropy
2	Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification
3	Exploring Entropy-based Term Weighting Schemes In Latent Dirichlet Allocation
4	Study Of Search Method Based On Group Characteristics
5	Simple Type Eos System Design And Its Key Technology
6	Research And Implementation On Variable Weighting In K-means Type Clustering
7	Research On Algorithm Of Feature Selection And Weighting In Text Classification
8	Research On Credibility Assessment Methods Of Complex Simulation Experiment Results
9	Study Of Weighting Fuzzy Clustering Algorithm Based On Generalized Entropy
10	Random Weighting Network Based On Cross Entropy