Font Size: a A A

Research Of Focused Crawler About Group Of University Website Based On RSS

Posted on:2013-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:R H ZhangFull Text:PDF
GTID:2218330374963962Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet is developing much faster and the number of pages is increasing, so when people want to get the information they need, they have to read a large number of web pages. It wastes people's time and energy, and also makes people unable to get the latest and most complete information. Network of information publishers hope that more users can read their information in real time. To meet this demand, a lot of research comes out, such as the search engine supported by the web crawler, RSS information pushing technology. But they have limitations, for example, we need to get the latest notice from all the sites of a university by category, such as the latest notice of the research category. A typical search engine can't return the satisfactory result. RSS can push the latest information in accordance with the classification, but the information which it pushed is limited to the websites which provide the RSS feed. So the RSS can't work on the websites which do not provide RSS feed at all such as university website group. Therefore, the focus of this study is the research of focused crawler based on RSS, and it's application insolving the above problem, and expansion to the group of the university website, which will achieved good results. Its principle is to use the focus web crawler to crawl, analyse and process the data of the site group, and then offer RSS feed. In this way, for those websites without RSS feeds, people can also use the RSS reader to subscribe their latest classification information. The research will reduce a lot of time spant in flipping through the pages to find the latest information and will reduce negligent omission of information.The main study contents are as follows:(1) To propose a new research of focused crawler based on RSS, the user can use a RSS reader, subscribe and read the latest information from the sites which did not provide the RSS feed. It filters unwanted ads and spam, and eliminates the trouble of finding information.(2) Use TF-IDF algorithm to classify the pages'text, and improve it on extracting category feature vector based on the characteristics of the web page, improving the accuracy of the feature vector, and making the classification more accurate.(3) The research improved incremental crawled of the web crawler. Proposed a new computing forecast update algorithm based on the traditional incremental algorithm, making the prediction closer to the actual update time, reducing system overhead and improving efficiency.(4) Applied the research of focused crawler based on RSS to the university website group, and improved the PageRank algorithm baseds on the characteristics of the university website group to raise the recall rate of Web crawler.
Keywords/Search Tags:Focused Web crawler, RSS, PageRank algorithm, TheTF-IDF algorithm, Incremental crawl
PDF Full Text Request
Related items