Similarity Graph-based Scientific Literature Search Key Technology Research

Posted on:2012-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:G Zhu

Full Text:PDF

GTID:2218330368494605

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Science and Technology are inheriting causes. Any worker in science and technology who wants to get good progresses can not do without previous experience and results. In recent years, since the releasing cycles of research results in the field of computer, biology, chemistry, medicine and so on are getting shorter, the number of scientific literature is getting accelerated growth. Currently, more than 70 million documents can be found just in CNKI, with 28000 literature published per day. Facing the growing literature resources, it has become a hot issue that how we understand the current major research achievements, analogy and innovative thinking of previous research areas or of others to promote scientific discovery and technological innovation and quickly and accurately search for similar literature.With the wide application of bioinformatics, chemical informatics, and social network analysis, the graph in modeling of complex structures, such as protein structures and neural network, is becoming increasingly important. In the real world, many problems in technology, commerce, economy, biology and other fields can be abstracted into similarity search problems of a graph. Based on this idea, this dissertation proposes document topology model, including undirected document topology and directed document topology and converts the similarity search problem of document into graph search problem. This dissertation includes the following two aspects:Firstly, based on undirected document topology, a new method of assessing document similarity is proposed, which combines the contents of documents and analysis of references between documents and applies principles of inclusion and exclusion to calculate the similarity between documents. The similarity search algorithm of documentâ€”Hub-N is also proposed based on the theory of ErdÇ’s. The algorithm uses a search technology of combining breadth-first with pruning strategy and reducing the range of scanned documents to improve the search efficiency and proves its effectiveness and feasibility by experiment. At the same time, Hub-N algorithm is also applicable to other fields of similarity search.Secondly, the feasibility, advantages and disadvantages of the PageRank algorithm applied to similarity search scientific document are analyzed, and the Improved PageRank algorithm IPR is proposed to solve the disadvantages of PageRank. IPR algorithm is based on directed document topology, combines the contents of documents and analysis of references between documents and solves the related needs and authority needs from the perspective of content analysis citation analysis, integrated computing similarity between documents to improve the accuracy of search results. Finally, it proves the effectiveness and feasibility of the IPR algorithm by experiments.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research On Semantic Similarity Computation And Applications
2	Web Document Automatic Classification Based On Keywords
3	Research On Improvement Of Search Algorithm Based On Web Similarity
4	Study On Similarity Search For Textual And Spatial Data
5	Subjective And Objective Combination Of Semantic Similarity Algorithm And Its Application
6	Research On Similarity Search Technique For Big Data
7	Design And Implement Of Dulplicate Document Detection Based On Similarity Estimation
8	Weighted Slope One Algorithm Optimization Based On User Similarity And Item Similarity
9	Occlusion Element Matching Algorithm Based On The Best Similarity Point Pair
10	Research On Traffic Terminology Similarity Matchment Based On Topic Vertical Search Engine