Font Size: a A A

Research And Realization Of Web Crawler And Results Clustering In Search Engine

Posted on:2012-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:P LiangFull Text:PDF
GTID:2178330338992006Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapid development of the network and search technology, more and more people obtain all kinds of information by Internet and search engines. A search engine needs spider crawl regularly through the network to grab new web pages and indexes the content for subsequent information retrieval, the crawl effect of spider directly influences the search engine's search results, the more indexed web pages, the shorter update cycle, then the chance of relativity between search results and user query will much higher, so search engine will return more search results and perform higher completeness. On the other hand, in recent years, as the diversity of search engine's services, network becomes one of the major ways of getting daily news information, so there are more and more demands for online clustering about Chinese short texts like spotlight news keywords, and also more demands for semantic clustering about words.In this paper, the main research work focuses on web crawler and short texts online clustering of search results in search engines. By studying on web crawler, this paper improves the core modules of crawler: page parse module and duplicated URL detection module to improve the crawler performance.(1) For page parse method, we change the way of extracting content by matching HTML tag, which is used in open source crawler Weblech system, and propose to convert the semi-structured document into XML, then express it as a DOM object to extract page contents. This approach makes use of structured document's extraction advantage, and also can improve the effectiveness of the procedure by using many excellent open source codes like DOM4J, JDOM to read and write XML.(2) In a crawler system, how to efficiently eliminate duplicate links is a complex issue,Bloom Filter as a classical probability algorithm performs very high space efficiency in URLs detection, but it may have the false positive rate, and the false positive rate will increase as the larger crawling scale. So we adopt a Partitioned Hashing method based on Bloom Filter to improve the duplicated URLs deletion module of the crawler.At last, the improved methods are implemented on Weblech, and the experimental results show that our improved crawler can not only get twice as many effective URLs as Weblech and another open source crawler Larbin, but also have a better throughput than them.For short texts online clustering of search results, considering that the existing online text clustering algorithm can't achieve very satisfactory results on Chinese short text clustering, at the same time, in order to satisfy the semantic or conceptual level clustering about words, this paper presents an Chinese online short text clustering algorithm, and the improved edit distance method is given to measure similarity between short texts, moreover a method making use of search engine's results is used for semantic similarity measure between words. As we all known, the current open source Carrot2 framework uses the search results clustering algorithm- Lingo, given it can not only effectively conduct text clustering, but also try to find the latent semantic relationship between words, so our methods are compared with Lingo on the extracted hot search terms, the results show that our clustering algorithm has higher F-Measure than Lingo and verify the effectiveness of our methods.The above research has been applied in national 863 fund project of China-"The automatic detection, analysis and assessment of video service website", and solved the problem of web crawler and hot news's clustering in project.
Keywords/Search Tags:Search Engine, Web Crawler, Online Short Texts Clustering, Edit Distance, Semantic Similarity Measure
PDF Full Text Request
Related items