Font Size: a A A

Algorithm Research Based On The The Copyright Services Network Data Collection

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L XiaFull Text:PDF
GTID:2248330395998326Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the network, the network communication speed fast and cost low, digital works easily transmit and spread on the Internet, it has brought unprecedented challenges to the digital rights management work. The unauthorized digital works reproduced on the internet, seriously damaged the spirit and interests of the owner of digital works. How to effectively detect digital works without the author’s authorization on the network, is an important work of network monitoring of Copyright protection. Because general search engine search a wide range of large-scale data acquisition, the retrieved results are often repeated, and a lot of web results are invalid or irrelevant, so the network data collection algorithm research base on copyright has practical significance.The paper introduces the composition and working principle of the general search engines, and then descripts the key technologies such as Web crawler and information extraction and so on. This paper discusses the URL filtering and crawling search strategy in detail. It discusses Hash algorithm and embedded database Berkeley DB algorithm, as well as search strategy based on content and URL link analysis. the advantages and disadvantages of these algorithms are compared and analyzed. Bloom Filter algorithm consumes less memory and run fast, the embedded database Berkeley DB URL filtering method has the advantages of stability. combining with digital music works of smaller depth level and the relatively stable format, the paper designs a new URL filtering algorithm. When the user requires the specified web it uses Bloom Filter URL filtering method, else uses Berkeley DB methods with MD5compress the URL address, this can better reduce the storage space. Considering "myopia" problem of content-based evaluation algorithm and algorithm "drifting" problem based on network link evaluation, combining the advantages of Shark Search algorithm and the Hits algorithm, the two method reinforcing each other, taking into account the contents of the topics and link, the paper proposes a new focused crawling strategy algorithm. The paper, based on the open source Heritrix framework, designs a vertical search engine, using the URL filtering of the proposed algorithms and search strategies, the experiment results show that the proposed URL filtering and search strategy improved the data acquisition speed and accuracy.The innovation of paper is to propose a new URL filtering algorithm and a new search strategy based on the combination of content and links. and it tests the efficiency of algorithms.
Keywords/Search Tags:Search engine, web crawler, Heritrix, search strategy, url filter
PDF Full Text Request
Related items