Font Size: a A A

A Study Of Web 2.0 Community Oriented Crawling Techniques

Posted on:2012-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:H GaoFull Text:PDF
GTID:2178330332976022Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web 2.0 community is the most popular Internet applications nowadays. Social networking, micro-blogging, online QA and post bar are the typical representatives. This kind of websites is characterized with involving user's participations in content creation and editing. In addition, they apply Ajax and other rich-client technologies extensively in order to enhance user experience.As information sources of Web 2.0 community are diversified, its uncertain publish pattern, timing of information, varied content quality and abundance of dynamic scripting all become outstanding problems. These issues prevent traditional search engines from performing effective information retrievals as usual, hence the existing crawling techniques call for improvements in both real-time search and client dynamic content index to adapt to the new wave of Internet evolution.In respect of real-time crawler, we focus on the crawl scheduling optimization problem based upon publish pattern predictions. We refine the local index quality metrics by introducing a new community content weight evaluation system, and combine it with delay metrics. We try to schedule a crawler to achieve a minimum weighted delay value, and figure out an optimized solution leveraging with historic publish data from specific community.On the other hand, we enable Ajax crawling capability. Since one Ajax page contains multiple states, we refer to a classic transition graph to model Ajax sites. By introducing heuristic invalid element inspecting and XmlHttpRequest monitoring, we boost the crawling performance as well as its recall rate.Finally, we propose a Web 2.0 community oriented crawler prototype, and succeed in applying it in a campus news search engine, which proves the effectives of our points of view from a practical application perspective.
Keywords/Search Tags:Web 2.0 Community, Publish Pattern, Crawl Scheduling, Transition Graph Model, Ajax Crawler
PDF Full Text Request
Related items