Font Size: a A A

Similarity Graph-based Scientific Literature Search Key Technology Research

Posted on:2012-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:G ZhuFull Text:PDF
GTID:2218330368494605Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Science and Technology are inheriting causes. Any worker in science and technology who wants to get good progresses can not do without previous experience and results. In recent years, since the releasing cycles of research results in the field of computer, biology, chemistry, medicine and so on are getting shorter, the number of scientific literature is getting accelerated growth. Currently, more than 70 million documents can be found just in CNKI, with 28000 literature published per day. Facing the growing literature resources, it has become a hot issue that how we understand the current major research achievements, analogy and innovative thinking of previous research areas or of others to promote scientific discovery and technological innovation and quickly and accurately search for similar literature.With the wide application of bioinformatics, chemical informatics, and social network analysis, the graph in modeling of complex structures, such as protein structures and neural network, is becoming increasingly important. In the real world, many problems in technology, commerce, economy, biology and other fields can be abstracted into similarity search problems of a graph. Based on this idea, this dissertation proposes document topology model, including undirected document topology and directed document topology and converts the similarity search problem of document into graph search problem. This dissertation includes the following two aspects:Firstly, based on undirected document topology, a new method of assessing document similarity is proposed, which combines the contents of documents and analysis of references between documents and applies principles of inclusion and exclusion to calculate the similarity between documents. The similarity search algorithm of document—Hub-N is also proposed based on the theory of ErdÇ's. The algorithm uses a search technology of combining breadth-first with pruning strategy and reducing the range of scanned documents to improve the search efficiency and proves its effectiveness and feasibility by experiment. At the same time, Hub-N algorithm is also applicable to other fields of similarity search.Secondly, the feasibility, advantages and disadvantages of the PageRank algorithm applied to similarity search scientific document are analyzed, and the Improved PageRank algorithm IPR is proposed to solve the disadvantages of PageRank. IPR algorithm is based on directed document topology, combines the contents of documents and analysis of references between documents and solves the related needs and authority needs from the perspective of content analysis citation analysis, integrated computing similarity between documents to improve the accuracy of search results. Finally, it proves the effectiveness and feasibility of the IPR algorithm by experiments.
Keywords/Search Tags:Similarity, Similarity Search, Document topology, Hub-N Algorithm, IPR Algorithm
PDF Full Text Request
Related items