Research And Realization Of Web Crawler And Results Clustering In Search Engine

Posted on:2012-08-18

Degree:Master

Type:Thesis

Country:China

Candidate:P Liang

Full Text:PDF

GTID:2178330338992006

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

With the rapid development of the network and search technology, more and more people obtain all kinds of information by Internet and search engines. A search engine needs spider crawl regularly through the network to grab new web pages and indexes the content for subsequent information retrieval, the crawl effect of spider directly influences the search engine's search results, the more indexed web pages, the shorter update cycle, then the chance of relativity between search results and user query will much higher, so search engine will return more search results and perform higher completeness. On the other hand, in recent years, as the diversity of search engine's services, network becomes one of the major ways of getting daily news information, so there are more and more demands for online clustering about Chinese short texts like spotlight news keywords, and also more demands for semantic clustering about words.In this paper, the main research work focuses on web crawler and short texts online clustering of search results in search engines. By studying on web crawler, this paper improves the core modules of crawler: page parse module and duplicated URL detection module to improve the crawler performance.(1) For page parse method, we change the way of extracting content by matching HTML tag, which is used in open source crawler Weblech system, and propose to convert the semi-structured document into XML, then express it as a DOM object to extract page contents. This approach makes use of structured document's extraction advantage, and also can improve the effectiveness of the procedure by using many excellent open source codes like DOM4J, JDOM to read and write XML.(2) In a crawler system, how to efficiently eliminate duplicate links is a complex issue,Bloom Filter as a classical probability algorithm performs very high space efficiency in URLs detection, but it may have the false positive rate, and the false positive rate will increase as the larger crawling scale. So we adopt a Partitioned Hashing method based on Bloom Filter to improve the duplicated URLs deletion module of the crawler.At last, the improved methods are implemented on Weblech, and the experimental results show that our improved crawler can not only get twice as many effective URLs as Weblech and another open source crawler Larbin, but also have a better throughput than them.For short texts online clustering of search results, considering that the existing online text clustering algorithm can't achieve very satisfactory results on Chinese short text clustering, at the same time, in order to satisfy the semantic or conceptual level clustering about words, this paper presents an Chinese online short text clustering algorithm, and the improved edit distance method is given to measure similarity between short texts, moreover a method making use of search engine's results is used for semantic similarity measure between words. As we all known, the current open source Carrot2 framework uses the search results clustering algorithm- Lingo, given it can not only effectively conduct text clustering, but also try to find the latent semantic relationship between words, so our methods are compared with Lingo on the extracted hot search terms, the results show that our clustering algorithm has higher F-Measure than Lingo and verify the effectiveness of our methods.The above research has been applied in national 863 fund project of China-"The automatic detection, analysis and assessment of video service website", and solved the problem of web crawler and hot news's clustering in project.

Keywords/Search Tags:

Search Engine, Web Crawler, Online Short Texts Clustering, Edit Distance, Semantic Similarity Measure

PDF Full Text Request

Related items

1	Research On Top-k String Similarity Search Based On Edit Distance
2	The Strategy Of Topic-specific Web Crawler Based On Semantics Similarity
3	Research On Semantic Similarity Measure Method For RDF Graphs
4	Research On Semantic Similarity Between Words And Between Short Texts Based On WordNet
5	The Research And Application Of Unsupervised And Supervised Short Text Similarity Measure
6	Research On Graph Search Problem Based On Edit Distance
7	Research On Algorithm Of Semantic Net Mining Of Short Texts Based On Wordnet
8	The Research On Key Technology Of Semantic Search Engine In Semantic Web
9	Research Of Intelligent Search Engine Based On Semantic Web
10	An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measure