Font Size: a A A

Research Of Large-scale Text Collection Duplicated Deletion

Posted on:2010-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:B HanFull Text:PDF
GTID:2178360272470156Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing development of the Internet, network information sharing brings people not only huge convenience, but also a great amount of duplicated information. Removing duplicated web pages can enhance the veracity of the search engine and reduce the mass data storage space to ameliorate the user's experiences. To find out duplicated information is beneficial to impact duplication and copy of the articles and protect the writer's originality. All in all, detecting repeated information of the web pages is a very significant research subject.The thesis sums up the traditional way of removing duplication by text in a detailed way and analyses the efficiency, veracity and recalling ratio of the problems solved in different ways.Removing repeated web pages and duplicating scientific papers become obvious problems. The present method is mainly based on keywords and semantic prints, which concerns too many processes and documents and takes less effect of web page noise into consideration, so more wrong judgments exist on the web pages actually. What's more, since search engine is super scale web page of removing duplication, to improve the efficiency is also the essential problem of the research to deal with.The thesis provides two methods to deal with the two types of duplication above and makes a research systematize on the methods of eliminating duplication and its application from the angles of theories and applications, as it follows:(1) Large-scale web pages duplicated deletion. The Repeated pages are mainly transshipment, and the bottleneck of Processing Replicated Web Collections is how to lower the effects of network noise and to improve the efficiency. For the problem of varieties in network noise and the inextensible limitation, the thesis refers to rules removing noise. The universal method of removing network noise based on the nodes repetition was discussed. In order to improve the efficiency of Processing Replicated Web Collections, an extraction algorithm of Web Site Structure based on most bulk was proposed, compared to the traditional method. Finally, the efficiency of large-scale Processing Replicated Web Collections was accepted as the application of method of the bloom filter characteristics code calculation and the B-Tree algorithm.(2) Copy detection for scientific paper. Nowadays, scientific paper duplication is severe, but there are few academic searches on automatically checking it. Therefore, this thesis makes an attempt. Because the scientific thesis is not wholly duplicated, the method to use signature can not hold water well. This thesis groups scientific papers by topic words which are enlarged by bootstrapping, then brings forward the method to calculate similarity by weight based on gliding bezel which is divided by chapters and adopts similarity curve graphs that is relatively intuitive to represent the calculated results, which comes out a good research effect.
Keywords/Search Tags:duplicated detection, removing noise, copy detection
PDF Full Text Request
Related items