Research On Web Information Extraction Techniques For Multi-channel Crawler

Posted on:2017-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Ma

Full Text:PDF

GTID:2348330503986817

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the explosive growth of the Internet, the Web has become a large platform for information publishing and consuming. It is essential to supervise the Internet public opinion effectively due to the rapid dissemination and extensive coverage. As the inherent semi-structured characteristics and large part of topic irrelevant noises, effectively extracting main content and filtering these noises is necessary and challenging. In a multichannel crawler system which focuses on news, blog and forum, which are all representative information channels, we face the following challenges: 1) enormous websites should be monitored; 2) websites have different structures and various layouts; 3) websites will change occasionally. These challenges motivated us to propose highly automated Web information extraction techniques to reduce the cost for system expansion and maintenance.For information-intensive websites like Web news and blog, we propose a template independent content extraction approach based on valid characters(CEVC). To validate the approach, we conduct experiments by using onling news and blog files arbitrarily crawled from well-known Chinese news and blog websites. Experimental result shows that our method achieves 95.8% F1-measure on average and outperforms previous methods CETR and CEPR. Although CEVC has almost equivalent extraction performance as CETD, CEVC has less dependence in the pre-processing stage thus more applicable.For typical forum websites, we utilize the ubiquitous date information in forum posts and propose a forum post extraction method(PEAN). To compare the effectiveness with MiBAT, which also uses the date information to extract posts, we conduct experiments on various Chinese forums. Experimental result shows that our method achieves much higher recall than MiBAT, and the F1-measure of 94.7% also outperforms MiBAT.In order to verify the practicality of the methods, we design a framework for parallel multi-channel crawler system and implement a Web news crawler based on it. With the template independent content extraction approach, the crawler has the ability to crawl new websites with little human efforts. The result shows that template independent content extraction methods have practical value on on multi-channel crawler system.

Keywords/Search Tags:

multi-channel, crawler, information extraction, template independent

PDF Full Text Request

Related items

1	Template Independent Web Information Extraction Research
2	Based On Templated Web Crawler Technology Of Web Page Information Extraction
3	Study And Realization Of Template-based Web Crawler And Editing System
4	Study Of Web Crawler And Web Information Extraction
5	Design And Implementation Of The Crawler Log Data Information Extraction And Statistical System
6	The Design And Development Of Deep-Customizable Crawler Tool System
7	Design And Implementation Of Building Materials Information Oriented Web Crawler System
8	Research On Web Page Classification And Information Collection
9	Information Extraction Algorithm Based On The Template Matching In Traffic Standards
10	Research And Application Of Automatic Data Extraction From Template-generated Web Pages