Font Size: a A A

The Research Of Focused Web Information Extraction Technology With Sparse Sample

Posted on:2015-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:L H LuoFull Text:PDF
GTID:2298330422989408Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the widespread consensus of information explosion in recent years, it isunable to rely solely on simple information retrieval to obtain useful informationfrom massive data. Hence the Web information extraction has become a central issue.Domestic and overseas researchers have proposed a variety of solutions to deal withthe massive web data and the heterogeneous characteristics problems. But there isstill not exist an effective web information extraction system which able to adapt to avariety of different information extraction tasks.Based on the current status, a web information extraction method for sparsesample set is proposed in this paper. By this method, users can extract specificinformation from abundant of similar structural pages as the sparse samples theygave before, usually a set with a single sample. The main content of this paper is asfollows:(1) Usually, it is necessary to download the full page content to compute thestructural similarity between two pages. It is obviously difficult to meet therequirement of capture giant similar webpages. In this paper, the feature of pages’URL is fully used to compute the approximately similarity between pages. By thisfast structural capturing algorithm, a focused crawler was designed to providepractical data resource for the web information extraction followed.(2) A model for the features of DOM tree node and a method to featureselection for the web information extraction under sparse sample set are proposed.The main idea is that the DOM tree node could be described by its neighbors. Theatomic features and the structural features which are determined by its parent nodes,child nodes and sibling nodes of a node, are selected by optimized user-orientedmethod based on statistical measure and users’ interests. Information is thenextracted by feature comparison method.(3) A web information extraction system with sparse sample set is implemented.Users only provide a sample page and the system can find structural similar pagesthen extract appropriate content for users automatically. This system can be appliedto a variety of information extraction tasks and provide cross-domain (cross-website)sorting, searching and other structured information services.
Keywords/Search Tags:Web information extraction, sparse example, featuremodel, feature selection
PDF Full Text Request
Related items