Font Size: a A A

The Research On Focused Web Information Extraction

Posted on:2016-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:S DaiFull Text:PDF
GTID:2308330479995434Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet has become a largest of information carrier which contains huge value. F or example, Google, Baidu, etc., can provide precise and high-efficient service through Internet information. But how to use web information effectively has become an important research topic. The massiveness, dynamics and heterogeneity characteristics of web information in cross-domain poses challenges for web information extraction. In order to improve the expandability, this paper researches on methods of web information access and information extraction. The main contents are as follows:(1) We propose an effective unsupervised focused crawler based on URL structure filtering(UURLSF), which guides the implementation of reptiles by analyzing the URL, and has a higher efficiency than others. And its unsupervised weight mechanisms can improve the portability of focused crawler.(2) We propose a visual unit-based extraction method which extracts news content according visual units. The visual units are identified by a top down approach based on visual features and text features. And the visual unit is independent of html and it is can improve the probability of method, meanwhile, it has a good effect.(3) We propose a modeless approach which called web information extraction based on increment clustering. It is a modeless and data-driven reasoning mechanism, and it issues global-based and local-based stability clustering evaluation methods respectively. The results of experiment prove that our approach has a good adaptability with the rapid growth of Internet data.
Keywords/Search Tags:Information extraction, Focused crawler, Visual unit, Increment clustering
PDF Full Text Request
Related items