Font Size: a A A

Research Of Removing Duplicated Webpages Algorithm Of Search Engine Based On Keywords

Posted on:2016-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HeFull Text:PDF
GTID:2308330464969788Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the Internet data gets exponential growth. Compared with the traditional media,such as newspapers, TV and radio broadcast, the Internet is more efficient and real-time, intuitive, and highly open, is the new generation of mass media.The information on Internet is numerous and complex, there is a lot of similar duplicated web pages. On the one hand, these similar duplicated web pages have brought considerable distress to Internet users, increasing the difficulty of information search; on the other hand, they reduce the efficiency of competitive intelligence system and search engine during the collection and analysis of Web information. Therefore, the research of removing similar duplicated webpages is a very practical research topic.Based on the analysis of the current technology of removing similar duplicated webpages,the paper introduced the whole process of removing similar duplicated webpages technology in detail, including webpage preprocessing, webpage feature extraction and similarity judgment etc., webpage preprocessing consisted of webpage format normalization and the extraction of webpage main content. Before the extraction of webpage main content, a DOM document structure tree needs to be built; then remove the noise nodes, such as picture nodes, form nodes and webpage script nodes; at last, locate the candidate subtree nodes and calculate the noise index of these nodes.In the removing similar duplicated webpages algorithm, the paper was based on the SimHash algorithm and made some appropriate improvements. To characterize the webpage main content more accurately, afte the webpage segmentation and removal of the insignificant words, the paper uses the the sequence of words based on single step forward mechanism as webpage features, this method considered the mutual position of words. In order to reduce the complexity of the algorithm of time and space, during the process of the webpage feature weight calculation, the paper extracts the appropriate number of webpage keywords as items of the inverted index system, a relevant document set is extracted via the inverted index system and the page fingerprint comparison times is reduced.Finally, the paper uses the open source project Nutch as a platform, by the means of modifying the source codes and increasing the plugins, the Chinese word segmentation module and the removing similar duplicated webpages module are added into the Nutch project, and the algorithm is carried out to test the efficiency. Experimental results show that the new algorithm has a certain degree of improvement in the precision rate and the recall rate than SimHash, and the using of the inverted index system to reduce the page fingerprint comparison times improves the operation stability of the new algorithm.
Keywords/Search Tags:Duplicated Webpages, Search Engine, Webpage Keywords, Nutch
PDF Full Text Request
Related items