Font Size: a A A

Research Of Deleting Duplicate Web Pages On Campus Search Engine

Posted on:2013-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:2248330392454330Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the campus network construction, campus network informationresources increase quickly, which makes difficult for teachers and students to locate valuableinformation quickly, and it is also a waste of time and efficiency. Based on the characteristics of thecampus network itself, mature general search engines cannot be applicable to the campus networkcompletely, and retrieval results have too many duplicate web pages due to the copy messages fromother websites. By analysis of the characteristics of the pages of the campus network and theexisting duplicate detection technology, the problem that the retrieval results have to many duplicatepages is solved. For different types of duplicate pages, the index and real-time search duplicatedetection strategies are used to build a campus network search engine. The work is finished asfollows:Firstly, web page duplicate detection algorithms are researched and analyzed. Foremost thecauses of web page noise, the definition and types of noise are analyzed, and correlation contentmerger technology is used to get rid of noise and extract content in order to obtain the content of theweb pages. Then, after the study and analysis of Chinese word segmentation technologies, Paodingword segmentation analysis is applied to Nutch secondary development--Nutch source is modifiedand Chinese word segmentation is realized.Secondly, duplication detection algorithm is researched and improved in the index phase. Afterthe analysis of existing algorithms, the duplicate detection algorithm based on the longest paragraphand fingerprint is proposed for the complete or part duplication of web pages. Firstly, duplicatepages are removed from the entire documents. Secondly, the document from which duplicate pageshave been removed and filleted is segmented. Paragraphs sorted are sorted, and then the first Nparagraphs are taken for fingerprint signatures which are used for the characteristics of the document.When the number of paragraphs which are the same in the two documents exceeds a given threshold,which determines whether these two documents are duplicate documents. The first N segments havebeen extracted and the fingerprint is sorted, which greatly reduces the computational complexity.Experiments show that this method has higher duplicate detection accuracy.Thirdly, duplicate pages modified minimum from the original pages use the optimized Fouriertransform duplicate detection algorithm. The algorithm maps each word of each document into anumerical fingerprint, and each document can be expressed as a discrete sequence of numbers. TheFourier coefficients are obtained by use of the Fourier transform to process the numerical sequence,and the similarity of the two series can be obtained by comparing the first several coefficients. The experiments show that recall rate and duplicate detection rate both can be taken into account basedon the optimized Fourier transform algorithm for the web pages which have been modified.Nutch is used as a system development tool and through the modification of the Nutch sourcecodeļ¼Œduplicate detection algorithm is achieved when the web pages are indexed. Web pageduplicate detection algorithm is achieved by the form of plug-in when the pages are being retrieved.Campus network search engine is designed and implemented on the basis of the Nutch. The campusnetwork search engine system development process and methods are described in detail. Finally,using Nutch to crawl the pages of the campus network as the experimental data sets, duplicatedetection strategy proposed is tested experimentally. The results show that the accuracy rate ofsearch results and the duplicate detection are improved by the combination of the two algorithms,and the campus network search engine system built can be runned effectively and steady.
Keywords/Search Tags:Duplicate Detection Algorithm, Fingerprint Signature, Fourier Transform, Campus Network, Search Engine, Nutch
PDF Full Text Request
Related items