Font Size: a A A

Focused Crawling

Posted on:2009-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z XueFull Text:PDF
GTID:2208360245461401Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the web, a series of questions have exposed in the traditional generic search engine, such as low coverage, large resource occupancy rate , long time to update, low pertinence and others. To tackle those problems of generic search engine and to satisfy specifically user subject-oriented query, the topic focused crawling technology came into being. The focused crawling, based on the generic search engine's technology, applies machine learning and other intelligence methods, to download more related page with low cost. Topic focused crawling technology has occupied a position with its high degree of specialization and objectives in next development of search engine from 1990s.Currently, the researches of topic focused crawling mainly concentrate on two hotspots: Document Categoration and Crawling Strategy.Document Categoration technology is studied in this thesis. Topic focused crawling usually classify documents by the algorithm of TF-IDF of SVM. However, TF-IDF algorithm only takes the term's frequency (TF) and the document frequency (DF) into consideration. It neglects the term's positional information and different positional term has different importance to classify the document. To solve this problem,"A algorithm of term weighting based on information of term position"is proposed in this paper: assign different weighting factors to different positional term, therefore the weighting values of the term can reflect the importance of the term objectively. This algorithm improves the veracity of document categorizaation. The weighting factors can be adjusted to get good result.According the disadvantage of Best-First search, and the information people used when they judge whether the hyperlinks are useful to him or not,"A crawling strategy based on comprehensive information of URL"is proposed in this thesis. It calculates the predicted correlation of the URL by analyzing the similarity of current Web page, URL directory information and anchor text of hyperlink, then it put the URL to different priority crawling queue according the URL's predicted correlation. For the URL having low correlation value, the system doesn't throw it away, but put it to the waiting queue. When other crawling queues are free, system crawls these URLs for finding new topic Web pages. This crawling strategy is simple, but it improves the efficiency and recall rate.Finally, based on theory of"A algorithm of term weighting based on information of term position"and"A crawling strategy based on comprehensive information of URL", the design and the implementation of the topic focused crawer, including the system structure and method are detailed in this thesis.
Keywords/Search Tags:topic focused crawler, term position, directory layer of URL, anchor text
PDF Full Text Request
Related items