Font Size: a A A

Research On Document Retrieval Technology Based On Graph

Posted on:2017-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:L N WangFull Text:PDF
GTID:2348330518970793Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Wiih the development of computer technology and the Internet, information retrieval is becoming an essential part of life. Nowadays, the usage of graph data is popular. The Internet is developing with the growth of big data, making more and more applications create graph data. So researches on graph data are on the trend lately.The main task of document retrieval is to compute the similarity between the query input by user and the document and retrieve sorted documents to user by similarities. The vector space model is the basic model in the information retrieval field, and it is the basic model in the document retrieval as well. Many popular document retrieval systems are still based on VSM. VSM regards terms as independence items, which break off the relationship of terms in the retrieve procedure. But terms do have relationships with each other in actual texts. This may lead to the situation in the document retrieval system based on VSM as follows; a document which was calculated to have a high similarity with query does not have enough relevance with query. What's worse, the document conveys opposite meaning compared with query. The reason why graph data is widely applied is that graphs have the ability to express relationships between node and edge directly.Based on the problem mentioned above, the paper proposes a document retrieval method based on graph. Queries and documents are represented by graphs. The similarity between the query and the document is calculated quantificationally by calculating the similarity between the query graph and the document graph. First, researches of dependency parsing and part-of-speech tagging in natural language processing are utilized to propose the graph model for text based on dependency parsing to represent texts as graphs. The paper also proposes the concept of the document semantic unit in consideration of the overhead of graph computing.The size of semantic unit is used as the granularity for graph construction. The method this paper proposes puts queries and documents in an unequal level instead of regarding them as equipotent entities which conventional IR system does. Next, based on the knowledge of graph theory,the paper proposes the similarity calculating method of graphs based on general maximum common subgraph. The similarity of the query graph model and the text graph model can be calculated by the method. Then, by using similarity data of queries and semantic units calculated above and considering different position of semantic units may lead to different importance of semantic units, the paper proposes scoring methods to calculate the similarity between the document and the query. The score is the basis of document sorting and retrieving. Last,experiments on Chinese and English document collections are proposed by analyzing results' quality performance of the method in different scoring methods and comparing with existing method and technology. The result of the experiment shows the method the paper proposes is able to yield document retrieval results with better quality.
Keywords/Search Tags:document retrieval, graph, text representation model, similarity
PDF Full Text Request
Related items