Font Size: a A A

Research On Topical Web Crawling

Posted on:2015-05-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YangFull Text:PDF
GTID:1108330470467810Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The World Wide Web (WWW) has become an important source of knowledge as the data volume grows rapidly. Different systems (e.g. search engines and enterprise intelligence systems) are collecting information from WWW. In fact, the research of web crawler was started since the invention of internet. As Web 2.0, social network and mobile Web become ubiquitous, user generated content is getting more common, which makes the data volume grows even faster. Hence, web crawling technologies remain hot topics among researchers for decades.Web crawling technologies are widely used in general search engines (e.g. Google, Baidu and Bing), which have the most mutual and powerful web crawlers. But most medium and small-sized enterprises don’t have enough computing resources as big giants to crawl the whole Web, which lack deep understanding of content and professional results. At the other end of the spectrum, most people just want to collect web pages according to a specific topic, which brings the invention of focused crawler (or topical crawler). It has become a big challenge to collect the most web pages related to specific topics with least computing resources.In this paper, web crawling technologies are thoroughly researched, key problems are identified and solutions are provided. The major contributions of this dissertation are summarized as below:1. A topic knowledge base centered web crawler framework is proposed to address the three core problems of topical crawler:topic requirement expression, hierarchical topic relatedness, and topical page cluster discovery. The framework provides synthesized topic expression method, adaptive knowledge learning process and knowledge based topic decision algorithm. Topic rich domain mining is used to further improve the crawling efficiency.2. A stable topic term set based topic requirement closure process is proposed to get users’ open and dynamic topic requirement. An iteractive expending-filtering framework is further provided to construct the stable topic term set automatically from the core topic term. Frequent item mining and LDA analysis are used to expend the term set; knowledge-base is used for filtering. The experiment results show that the stable topic term set can present the topic well.3. An ontology based topic decision algothm is proposed in topical crawler to address the hierarchical topic relatedness problem. Entities and relations between entities are used to reduce the dimensionality of web pages. The experiment results show that the synthetical ontology weighting method can impove the accuracy of topic decision, thus improve the harvest rate of topical crawler.4. A Topic Rich Domain First (TRDF) crawling strategy is proposed according to the aggregation feature of internet topical information. TRDF algorithm divides the topic domain into three sets, which follow different fetch rules. The experiment results show that TRDF algorithm overmatch current algorithms both on precision and recall.
Keywords/Search Tags:Web Crawler, Topical Crawler, Topic Requirement, Ontology, Topic Rich Domain
PDF Full Text Request
Related items