Webpage Content Extraction Techniques For Specific Topic

Posted on:2016-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Li

Full Text:PDF

GTID:2348330536467738

Subject:Computer Science and Technology

Abstract/Summary:

As an important supporting technology of information science and network engineering,Web Crawler has increasingly play an prominent role in the information age,and development of related technology attracted worldwide attention.Meanwhile,the Web Crawler technology is a primary mean to retrieve sensitive information in networks,a large number of domestic and foreign researchers are studying the subject.However,with further research,the development of web crawler technology faces challenges of the complexity and diversity of information in networks.To solve this problem,this paper researches the technology of topicalweb crawler and sensitive infomation retrieve,and has achieved the following research results:1.In order to effectively implement extraction of sensitive content in network,this paper proposed a model of webpage content extraction techniques fo specific topic,which provides basic framework for the achievement of sensitive information retrieval based on web crawler techniques.2.In order to effectively utilize the performance advantage of topical web crawler based on link analysis and the accuracy advantage of web crawler based on content analysis,this paper designed and implenented PageRank-based link analysis techniques.3.In order to improve the accuracy of topical web crawler on sensitive information retrieval,and achieve maximize coverage of webpage content,this paper proposed a comprehensive webpage content analytical method based on DOM tree model and keyword correlation analysis.4.This paper designed and implemented webpage sensitive content extraction techniques based on Scrapy,a open source crawler framework,and tested it in internet by retrieving sensitive informaiton.The result of test has verified the effectiveness of this study.

Keywords/Search Tags:

Topical Web Crawler, Web Content Extraction, Keyword analysis, Link Analysis, Scrapy crawler framework

Related items

1	Analysis Of Dangdang Information Based On Scrapy Framework Crawler And Data Mining
2	Research And Realization Of Topical Crawler Based On Content And Hyperlink
3	Design And Development Of Distributed Crawler Based On Scrapy Framework
4	Research On Topical Crawler Combining Web Page Content And Hyperlink
5	Design And Implementation Of Distributed Web Crawler System Based On Scrapy
6	Scrapy Framework-based Web Crawler Achieved Data Capture And Analysis
7	Design And Implementation Of Web Crawler System Based On Scrapy Framework
8	Content Resource Evaluation Base On Web Crawler
9	Design And Implementation Of Customizable Crawler Engine In Content Convergent Subsystem
10	Design And Implementation Of Web Crawler For Given Page