Font Size: a A A

Page Events Information Extraction

Posted on:2011-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y M HeFull Text:PDF
GTID:2208360305997821Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet provide a lot of information for various uses in various formats, it makes difficult to extract information from webpage consistently from different websites,so there exists a lot of automatic web information extractors which has various degree of robustness and flexibility. For a specific event extraction task, this paper provides a unsupervised web event extraction framework and the framework can extract all kinds of hot events from many source websites. This framework can cover a lot of real world events and has high precision, which can be used against a mass of web sites.At first, this paper analyzes existing web information extraction systems, and compares with each other in every aspect, then for the specific event extraction system, after noticing the two kinds of presentation:table and detail page,this paper provides two different methods, it uses DOM extraction method for table pages which have a parallel structure and uses pattern extraction method for detail pages which have some common word segmentation. We also discuss the disadvantages of these two methods at the same time.The dataset we used for experiments come from 15 websites famous for publishing events, we use recall and precision same with information system for evaluation.15 websites are used to verify our method and the result of extraction, which is compared to common wrapper-generation algorithm, indicated that our method is feasible and better than wrapper-generation algorithm in quality of detail webpage extraction. The results also show our method is effective for most of these websites.
Keywords/Search Tags:Information extraction (IE), Web event extraction, Table webpage, Detail webpage
PDF Full Text Request
Related items