Font Size: a A A

Webpage Content Extraction Techniques For Specific Topic

Posted on:2016-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LiFull Text:PDF
GTID:2348330536467738Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As an important supporting technology of information science and network engineering,Web Crawler has increasingly play an prominent role in the information age,and development of related technology attracted worldwide attention.Meanwhile,the Web Crawler technology is a primary mean to retrieve sensitive information in networks,a large number of domestic and foreign researchers are studying the subject.However,with further research,the development of web crawler technology faces challenges of the complexity and diversity of information in networks.To solve this problem,this paper researches the technology of topicalweb crawler and sensitive infomation retrieve,and has achieved the following research results:1.In order to effectively implement extraction of sensitive content in network,this paper proposed a model of webpage content extraction techniques fo specific topic,which provides basic framework for the achievement of sensitive information retrieval based on web crawler techniques.2.In order to effectively utilize the performance advantage of topical web crawler based on link analysis and the accuracy advantage of web crawler based on content analysis,this paper designed and implenented PageRank-based link analysis techniques.3.In order to improve the accuracy of topical web crawler on sensitive information retrieval,and achieve maximize coverage of webpage content,this paper proposed a comprehensive webpage content analytical method based on DOM tree model and keyword correlation analysis.4.This paper designed and implemented webpage sensitive content extraction techniques based on Scrapy,a open source crawler framework,and tested it in internet by retrieving sensitive informaiton.The result of test has verified the effectiveness of this study.
Keywords/Search Tags:Topical Web Crawler, Web Content Extraction, Keyword analysis, Link Analysis, Scrapy crawler framework
PDF Full Text Request
Related items