Font Size: a A A

Research On Duplicate Webpage Detection Technology In Search Engine

Posted on:2012-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:R TangFull Text:PDF
GTID:2218330344450972Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, everyone is enabled to access a vast amount of information. How to retrieve the correct and intended information in a timely manner becomes a critical problem. Search engine is such a tool to provide useful query results. In practice, however, heavily duplicated web pages always exist in the query results. The duplication not only decreases the query efficiency of a search engine and wastes memory space for buffering query results, but also significantly degrades user experience. Duplication detection is a desired technique for improving query efficiency and service quality of search engines, which is the main topic of the thesis.In the thesis, take search engine's duplication web pages detection question as the research background. First a detailed survey of prior research results is given, with the summary and comparison of different algorithms; then proposed one based on the semantic duplication web pages detection algorithm. The algorithm main optimizing both feature extraction and comparisons. More specifically, in the text pretreatment, according to the words has the abundant semantic relations increased the synonym and liquid word combination; the feature extraction relies on the semantics context of word. Compared to conventional algorithms, weighting factors of position and length of key words are added. Furthermore, a binary tree sort is used in feature comparison, which greatly improves the efficiency compared to conventional pair-wise comparison algorithm. And how sort the extraction key word, also proposed two method, one kind according to the key word weight sorting method is called SORTw(Kd); another according to the key word letter sorting method is called SORTa(Kd).To verify the efficacy and efficiency of the proposed algorithm, a simulation environment is built on Windows OS. Simulation results reveal that the proposed algorithm has the advantages of relatively high precision, high recall rate and low time-space complexity. They are promising for practical applications after further improvements.
Keywords/Search Tags:Duplication Detection, Feature Extraction, Feature Comparisons, MD5, Binary Tree
PDF Full Text Request
Related items