Font Size: a A A

The Research And Implementation Of Similarity Algorithm For Web Pages

Posted on:2006-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhangFull Text:PDF
GTID:2168360155453047Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many similar web pages and documents in the wide web, how to find thesimilar web pages efficiently and accurately, it is important to improve qualityof service of the search engine, and also important to the increased topiccrawler. The similarity detection can also be used for plagiarism detection andnear-replicas detection; in this paper, we talk about the similar detect of topiccrawler.A crawler is a program that retrieves and stores pages from the Web.Through the robot, commonly for a Web search engine. A crawler often has todownload hundreds of millions of pages, user can get search result from localdatabase. However we often find a great deal of repetition in the search result,they are similar documents and copies or different edition or versions of thesame work, and so on. We use similar detect to find out these similardocuments, so far there are a few way to detect them. Such as sif tool, createdby Udi Manber at ARIZONA university in 1993, which is used to find SimilarFiles in a Large File System. The Brin and Garcia-Molina of the Stanforduniversity put forward the text copy detection mechanism COPS(copyprotection system) system. Garcia-Molina and Shivakumar also put forwardthe SCAM (Stanford copy analysis method) experimental prototypes forfinding intellectual property violations, which is an improvement version ofthe COPS System. SCAM is based on VSM (vector space model, A popularmodel in the IR domain), through the words occurrence frequencies to identifysimilar documents. Our method is better than the above in web pagessimilarity detection."Noise" in the web page, so it is difficult to extract useful informationfrom the web page, we propose a noise elimination technique, and extractfeature information from web pages, these information include the hyperlinkand label of HTML... etc., we evaluate chunks based on web page structure...
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items