Font Size: a A A

Research On Web Information Extraction Techniques For Multi-channel Crawler

Posted on:2017-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:X Y MaFull Text:PDF
GTID:2348330503986817Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the Internet, the Web has become a large platform for information publishing and consuming. It is essential to supervise the Internet public opinion effectively due to the rapid dissemination and extensive coverage. As the inherent semi-structured characteristics and large part of topic irrelevant noises, effectively extracting main content and filtering these noises is necessary and challenging. In a multichannel crawler system which focuses on news, blog and forum, which are all representative information channels, we face the following challenges: 1) enormous websites should be monitored; 2) websites have different structures and various layouts; 3) websites will change occasionally. These challenges motivated us to propose highly automated Web information extraction techniques to reduce the cost for system expansion and maintenance.For information-intensive websites like Web news and blog, we propose a template independent content extraction approach based on valid characters(CEVC). To validate the approach, we conduct experiments by using onling news and blog files arbitrarily crawled from well-known Chinese news and blog websites. Experimental result shows that our method achieves 95.8% F1-measure on average and outperforms previous methods CETR and CEPR. Although CEVC has almost equivalent extraction performance as CETD, CEVC has less dependence in the pre-processing stage thus more applicable.For typical forum websites, we utilize the ubiquitous date information in forum posts and propose a forum post extraction method(PEAN). To compare the effectiveness with MiBAT, which also uses the date information to extract posts, we conduct experiments on various Chinese forums. Experimental result shows that our method achieves much higher recall than MiBAT, and the F1-measure of 94.7% also outperforms MiBAT.In order to verify the practicality of the methods, we design a framework for parallel multi-channel crawler system and implement a Web news crawler based on it. With the template independent content extraction approach, the crawler has the ability to crawl new websites with little human efforts. The result shows that template independent content extraction methods have practical value on on multi-channel crawler system.
Keywords/Search Tags:multi-channel, crawler, information extraction, template independent
PDF Full Text Request
Related items