Research On Topical Web Crawling

Posted on:2015-05-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Yang

Full Text:PDF

GTID:1108330470467810

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The World Wide Web (WWW) has become an important source of knowledge as the data volume grows rapidly. Different systems (e.g. search engines and enterprise intelligence systems) are collecting information from WWW. In fact, the research of web crawler was started since the invention of internet. As Web 2.0, social network and mobile Web become ubiquitous, user generated content is getting more common, which makes the data volume grows even faster. Hence, web crawling technologies remain hot topics among researchers for decades.Web crawling technologies are widely used in general search engines (e.g. Google, Baidu and Bing), which have the most mutual and powerful web crawlers. But most medium and small-sized enterprises donâ€™t have enough computing resources as big giants to crawl the whole Web, which lack deep understanding of content and professional results. At the other end of the spectrum, most people just want to collect web pages according to a specific topic, which brings the invention of focused crawler (or topical crawler). It has become a big challenge to collect the most web pages related to specific topics with least computing resources.In this paper, web crawling technologies are thoroughly researched, key problems are identified and solutions are provided. The major contributions of this dissertation are summarized as below:1. A topic knowledge base centered web crawler framework is proposed to address the three core problems of topical crawler:topic requirement expression, hierarchical topic relatedness, and topical page cluster discovery. The framework provides synthesized topic expression method, adaptive knowledge learning process and knowledge based topic decision algorithm. Topic rich domain mining is used to further improve the crawling efficiency.2. A stable topic term set based topic requirement closure process is proposed to get usersâ€™ open and dynamic topic requirement. An iteractive expending-filtering framework is further provided to construct the stable topic term set automatically from the core topic term. Frequent item mining and LDA analysis are used to expend the term set; knowledge-base is used for filtering. The experiment results show that the stable topic term set can present the topic well.3. An ontology based topic decision algothm is proposed in topical crawler to address the hierarchical topic relatedness problem. Entities and relations between entities are used to reduce the dimensionality of web pages. The experiment results show that the synthetical ontology weighting method can impove the accuracy of topic decision, thus improve the harvest rate of topical crawler.4. A Topic Rich Domain First (TRDF) crawling strategy is proposed according to the aggregation feature of internet topical information. TRDF algorithm divides the topic domain into three sets, which follow different fetch rules. The experiment results show that TRDF algorithm overmatch current algorithms both on precision and recall.

Keywords/Search Tags:

Web Crawler, Topical Crawler, Topic Requirement, Ontology, Topic Rich Domain

PDF Full Text Request

Related items

1	Research And Implementation Of Topic Crawler Based On Domain Ontology
2	Research On Algorithms Of Real Estate-Ontology Topical Crawler
3	Design And Implementation Of Multithreading Web Crawler Oriented Topic
4	Research And Implementation On Algorithms Of Topical Crawler
5	The Topical Web Crawler Research In Vertical Search Engine
6	Research On Topic Focused Web Crawler And Related Technologies
7	Research And Design Of Topic Crawler Through Tunnels Algorithm
8	Research On The Key Technology Of Focused Crawler
9	Research And Implementation Of Scientific Topic Search Engine Crawler Based On Nutch
10	Research On The Topical Crawler For The Cultural Fields