Font Size: a A A

Web Page Weight Elimination Technique Research And Implementation

Posted on:2013-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:C QiFull Text:PDF
GTID:2248330374486465Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development and extensive application of Internet, the information on theWeb has increased explosively. Internet has become the main information source forpeople. In order to quickly find the information that people need, the technology ofsearch engine appears. The technology of search engine provides convenience forpeople to search information, and saves customers’ time. It has become a popular onlineservice.But according to the report from CNNIC, the search result with too many duplicateweb-pages is the main problem that people meet when using the search engine.According to the statistics, there are about30%duplicate web-pages on Internet,mostof which are caused by reprint. Duplicate web-pages have certain effect on searchengine. Duplicate web-pages not only waste the storage space, but also increase thetreating time. At the same time, the retrieval quality of search engine is reduced becausethe retrieval results contain a lot of duplicate web-pages. Therefore, duplicateweb-pages detection is an absolutely necessary job to the search engine.The origin and the development situation of duplicate web-pages detection areresearched in this thesis. The research jobs are as follows:(1): Duplicate web-pages detection of high quality is based on the web text. Afterthe internal structure of Web page is researched, the extracting algorithm of web pagetext based on the DOM structure is proposed. The web-page text is extracted throughblock, union and filtration, as the object of duplicate web-pages detection.Theexperiment proves that the algorithm is highly accurate.(2): An online duplicate web-pages detection system is implemented, whichimplements two algorithms for detecting duplicate web-pages: digest detection and fulltext detection. Using the system, the retrieval quality of results from search engine isimproved.(3): Two algorithms of duplicate web pages detection are proposed. One is basedon word frequency, and the other is based on segmentation. (4): The algorithm based on word frequency extracts word frequency from webtext as the main characteristic string, and its additional information as subsidiarycharacteristic string. Algorithm compares these characteristic strings by edit-distancetree that reduces the comparison times of characteristic strings and consequentlyalgorithm’s efficiency is better than that of conventional one.(5): The algorithm based on segmentation segments the web text on the basis ofparagraph, extracts the longest sentence in each segment as its characteristic string, andthen compares these characteristic strings by HASH algorithm. This algorithm’sefficiency is very ideal and its accuracy is high.(6): Thesis compares the two algorithms mentioned above with the algorithm basedon punctuation in terms of efficiency, accuracy, and recall, finally analyses theadvantages and disadvantages of these algorithms.
Keywords/Search Tags:duplicate web-pages detection, word frequency, edit distance, paragraphsegmentation, characteristic string
PDF Full Text Request
Related items