Font Size: a A A

Research And Implementation On Removing Duplicated WebPages Algorithm Of Search Engine

Posted on:2010-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:X LvFull Text:PDF
GTID:2178360302966525Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web Mining is our request in gathering information from big scale knowledge. As for Search Engine, Web Mining technology plays an important role in development of the third generation of search engine, and meanwhile, it promotes the network information acquiring technology to a high precision and intelligent way.How to obtain the useful information from vast contents quickly and accurately is a problem for people who are enjoying the convenience of the Internet. This thesis will focus on these problems of taking good advantage of the information and providing users a more effective way to more efficient searching. Those are heated problems being discussed in the filed of search engine technology.The thesis centers on the key technologies of Chinese search engine system. The following points are concerned:(1). An algorithm of eliminating duplicated webpage based on the extraction of key words of the webpage is presented, which based on the analysis of the traditional algorithms of detecting duplication technology. The experiment indicates the improved algorithm is better than the traditional ones in both the process speed and the recall rate.(2) Some traditional sort algorithms, that is PageRank algorithm and WTPR algorithm, their advantage and disadvantage are analyzed. According to uncondering with the documents' relativity and the temporal in Lucene, the WTPR algorithm module has been designed and implemented.(3) A search engine experiment prototype system based on Lucene tool kit has been built, embeded with K-CC algorithm modules. Expermental results show that the improved ranking algorithm and the algorithm of removing duplicated web pages outperform the original ones.
Keywords/Search Tags:PageRank, Search Engine, Hyperlink, Duplicate Removal Algorithm, Web Page, Lucene
PDF Full Text Request
Related items