Algorithm Research Based On The The Copyright Services Network Data Collection

Posted on:2014-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:J L Xia

Full Text:PDF

GTID:2248330395998326

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of the network, the network communication speed fast and cost low, digital works easily transmit and spread on the Internet, it has brought unprecedented challenges to the digital rights management work. The unauthorized digital works reproduced on the internet, seriously damaged the spirit and interests of the owner of digital works. How to effectively detect digital works without the author’s authorization on the network, is an important work of network monitoring of Copyright protection. Because general search engine search a wide range of large-scale data acquisition, the retrieved results are often repeated, and a lot of web results are invalid or irrelevant, so the network data collection algorithm research base on copyright has practical significance.The paper introduces the composition and working principle of the general search engines, and then descripts the key technologies such as Web crawler and information extraction and so on. This paper discusses the URL filtering and crawling search strategy in detail. It discusses Hash algorithm and embedded database Berkeley DB algorithm, as well as search strategy based on content and URL link analysis. the advantages and disadvantages of these algorithms are compared and analyzed. Bloom Filter algorithm consumes less memory and run fast, the embedded database Berkeley DB URL filtering method has the advantages of stability. combining with digital music works of smaller depth level and the relatively stable format, the paper designs a new URL filtering algorithm. When the user requires the specified web it uses Bloom Filter URL filtering method, else uses Berkeley DB methods with MD5compress the URL address, this can better reduce the storage space. Considering "myopia" problem of content-based evaluation algorithm and algorithm "drifting" problem based on network link evaluation, combining the advantages of Shark Search algorithm and the Hits algorithm, the two method reinforcing each other, taking into account the contents of the topics and link, the paper proposes a new focused crawling strategy algorithm. The paper, based on the open source Heritrix framework, designs a vertical search engine, using the URL filtering of the proposed algorithms and search strategies, the experiment results show that the proposed URL filtering and search strategy improved the data acquisition speed and accuracy.The innovation of paper is to propose a new URL filtering algorithm and a new search strategy based on the combination of content and links. and it tests the efficiency of algorithms.

Keywords/Search Tags:

Search engine, web crawler, Heritrix, search strategy, url filter

PDF Full Text Request

Related items

1	Design And Implementation Of Vertical News Search Engine Based On Heritrix
2	Design Of Search Engine Based On Lucene And Heritrix
3	Research And Application Of Focusing Crawler Which Faced Vertical Search Engine
4	Research On Key Technology Of Vertical Search Engine
5	Studies And Examples Of Search Engine Based On Lucene And Heririx Build
6	Design And Implementation Of Vertical-Search-Engine-Oriented Spider
7	Research Of Uighur Information Search Engine Based On Heritrix
8	Search Engine System Inside Web Site Based On Lucene And Heritrix
9	Research On Search Strategy And Algorithm Of Network Search Engine
10	Research And Implementation Of The Vertical Search Engine System Based On JAVA With LUCENE And HERITRIX